具身智能之心
Search documents
NeurIPS'25! AutoSeg3D:在线完成任意3D分割,只需1张4090
具身智能之心· 2025-12-12 01:22
点击下方 卡片 ,关注" 具身智能 之心 "公众号 编辑丨 具身智能之心 本文只做学术分享,如有侵权,联系删文 >> 点击进入→ 具身智能之心 技术交流群 更多干货,欢迎加入国内首个具身智能全栈学习社区 : 具身智能之心知识星球 (戳我) , 这里包含所有你想要的。 前沿 在大模型时代都在拼scaling,对于具身自驾这种任务似乎都想要8卡起步训练一个模型,今天借助分享的机会也给大家推荐可以1张4090就能发顶会的方向,就是本 文写的具身场景点云实例分割。当然不是说推荐给大家一个用少量资源"水论文"的方式,当时让学生做这个方向也是因为觉得是可以真实落地的技术,也没特别出 乎意料,这篇论文已经开始在两家公司进行技术转化切实落地。对于具身来说,VLA或者各种所谓世界模型是很fancy,但是还有很多听起来不那么fancy的方向既 能发论文又能真实落地,也希望能看到更多底层技术可以被研究优化支撑真正的产业化。 也欢迎大家来无界-AutoLab联合实验室(上海)实习,一起共创各种有意 思的技术方向:) -- Dylan老师 论文总结 (1)作者发现现有的在线 VFM 辅助方法通常先用 SAM 等 VFMs 预测 2D ...
AAAI 2026 Oral | 机器人也能“看人学活”?一次示范就能学会新任务!
具身智能之心· 2025-12-12 01:22
Core Insights - The article discusses a novel approach to robot learning through human demonstration, emphasizing the importance of fine-grained action alignment between human and robot movements [3][4][8]. - The proposed method, Human2Robot, utilizes a new dataset (H&R) and a two-stage framework to enhance robot learning capabilities, enabling one-shot generalization to new tasks [3][4][9]. Summary by Sections Introduction - The article introduces the limitations of existing methods that rely on coarse alignment of human-robot video pairs, which often leads to a lack of understanding of fine-grained actions necessary for task generalization [3][8]. Methodology - A new dataset, H&R, consisting of 2,600 synchronized human and robot action videos, is introduced to facilitate better learning [9]. - The Human2Robot framework consists of two main stages: Video Prediction Model (VPM) and Action Decoder [12][16]. Video Prediction Model (VPM) - The VPM generates robot action videos based on human demonstrations, allowing the model to learn detailed action dynamics [13][14]. - The model captures key information about the robot's shape and human hand movements through Spatial UNet and Spatial-Temporal UNet [15]. Action Decoder - The Action Decoder translates the generated video features into specific robot movements, enabling real-time task execution without needing continuous video input [16][20]. Experimental Results - Human2Robot outperforms existing baseline methods by maintaining a success rate improvement of over 10-20% across various tasks, demonstrating its effectiveness in leveraging detailed human video conditions [20][27]. - The introduction of KNN in the Human2Robot framework shows that it can still perform well even without direct demonstration input, indicating robust task execution capabilities [20][27]. Generalization Capability - Human2Robot exhibits strong generalization across different tasks, including new positions and object instances, due to the clear action correspondences established by the H&R dataset [27]. Ablation Studies - The effectiveness of the VPM is validated through experiments showing that relying solely on human video input leads to poor performance, highlighting the necessity of the video generation process for reliable action mapping [25][26].
具身智能之心求职与内推服务正式对外啦!
具身智能之心· 2025-12-11 09:33
具身智能之心的职位内推服务正式对外啦!近50家主流具身公司,校招&社招&实习均可。第一时间拿到靠谱 和高薪的岗位,欢迎简历砸来~ 这些问题,我们将逐个答复~~~ 各家公司的薪资结构 √ 技术路线和上升通道 √ 未来行业发展的前景 √ 工作到底适不适合自己的性格 √ ...... ...
只用SO-100可以完成π0和π0.5的效果吗?
具身智能之心· 2025-12-11 09:33
Core Viewpoint - The article discusses the challenges and complexities faced by beginners in implementing VLA (Vision-Language Alignment) models, emphasizing the need for practical experience and effective training methods to achieve successful deployment in real-world applications [2][4]. Group 1: Challenges in VLA Implementation - Many students report difficulties in achieving effective results with open-source models like GR00T and PI0, despite low training loss in simulations [2][4]. - The transition from simulation to real-world application (sim2real) poses significant challenges, particularly in data collection and model training [6][7]. - Beginners often struggle with the intricacies of data collection, model training, and deployment, leading to frustration and lack of progress [4][10]. Group 2: VLA Model Components - Data collection methods for VLA primarily include imitation learning and reinforcement learning, with a focus on high-quality data acquisition [6]. - Training VLA models typically requires simulation debugging and fine-tuning, especially when real-world data is limited [7]. - Deployment of VLA models necessitates optimization techniques such as model compression to ensure efficient performance on edge devices [9]. Group 3: Educational Initiatives - The article introduces a practical course aimed at helping students effectively learn VLA, covering various aspects such as hardware, data collection, algorithms, and real-world experiments [10][12]. - The course is designed for individuals seeking to enter the field of embodied intelligence, providing hands-on experience and project support [22][25]. - The course will commence on December 30, 2025, and includes a comprehensive curriculum to enhance participants' skills in VLA [23][26].
全球首个!灵巧手真实世界具身数采引擎Psi-SynEngine来了
具身智能之心· 2025-12-11 04:02
以下文章来源于灵初智能 ,作者PsiBot 灵初智能 . 灵初智能推进基于强化学习算法的机器人技能集训练、场景化的数据生成及采集、端到端解决方案的研发及落地,打造业界领先的通用操作智能体。 作者丨 PsiBot 编辑丨机器之心 点击下方 卡片 ,关注" 具身智能之心 "公众号 >> 点击进入→ 具身 智能之心 技术交流群 更多干货,欢迎加入国内首个具身智能全栈学习社区: 具身智能之心知识星球(戳我) ,这里包含所有你想要的! 灵初智能发布全球首个具身原生人类数据采集方案 Psi-SynEngine。该方案由灵初智能全栈自研,包含便携式外骨骼触觉手套数采套装、大规模 in the wild 数采数 据管线、基于世界模型和强化学习的跨本体数据迁移模型,并已率先将采集到的人类数据应用于物流等真实场景。同时,灵初智能同步发布覆盖视觉、语言、触 觉、动作的大规模真实世界多模态数据集 Psi-SynNet-v0。这一突破性成果标志着灵初智能全自研的真实世界具身数据引擎已经正式启动。 相比大模型和自动驾驶,数据问题一直是困扰整个具身智能领域的痛点。行业现有的数据采集方案: 灵初智能 Psi-SynEngine 从根本上突破 ...
连场景都做?这家给智元机器人造大脑的公司4个月融了3个亿
具身智能之心· 2025-12-11 04:02
点击下方 卡片 ,关注" 具身智能 之心 "公众号 这家刚成立不到半年,天使轮就2亿融资的新秀到底是做什么的? 日前,为智元机器人提供相关大脑产品的公司星源智机器人,完成了超亿元人民币天使+轮融资。本轮募集资金将用于具身大脑专业版RoboBrain Pro的研发、垂直 行业解决方案的拓展以及高端人才的引进等。 这家刚成立不到半年,天使轮就2亿融资的新秀到底是做什么的? 星源智下半年在北京亦庄注册,由北京智源研究院孵化,目标是"让机器人理解物理世界并自主行动"。CEO刘东曾任京东智能驾驶总经理,牵头京东无人配送车全 国落地;联合创始人穆亚东为北大研究员、智源学者,近五年发表具身智能顶会论文30余篇。 产品上,星源智推出了"通用大脑"+"算力炸弹"。 跨本体的RoboBrain:同一套AI系统可在机械臂、AGV、人形机器人之间即插即用,无需重复训练。在智元发布的新一代工业级交互式具身作业机器人"精灵G2"上 就搭载了星源智的具身大脑产品(作为星源智的合作伙伴,智元早在公司天使轮融资中就参与其中)。 2070 TOPS端侧算力平台:星源智T5算力平台基于NVIDIA Jetson Thor处理器开发,具备强大的Tr ...
从视频生成到机器人操控:VideoVLA 开启通用机器人新范式
具身智能之心· 2025-12-11 04:02
点击下方 卡片 ,关注" 具身智能 之心 "公众号 作者丨 Yichao Shen等 编辑丨具身智能之心 本文只做学术分享,如有侵权,联系删文 >> 点击进入→ 具身智能之心 技术交流群 更多干货,欢迎加入国内首个具身智能全栈学习社区 : 具身智能之心知识星球 (戳我) , 这里包含所有你想要的。 在机器人操控领域,视觉 - 语言 - 动作(VLA)模型虽已实现语言指令驱动的任务执行,但在陌生物体处理、跨机器人技能迁移等泛化能力上仍存在瓶颈。现有方 案多依赖预训练视觉语言理解模型,难以突破训练数据的场景限制。 由西安交通大学、微软亚洲研究院等机构联合提出的 VideoVLA 框架 ,创新性地将大规模视频生成模型转化为端到端 VLA 系统,通过 "动作预测 + 视觉想象" 双目 标策略,首次实现了机器人在未知场景下的稳健泛化,为通用机器人操控提供了全新技术路径。 论文题目:VideoVLA: Video Generators Can Be Generalizable Robot Manipulators 核心贡献: 首次将视频生成模型改造为通用机器人操控系统,通过联合预测动作序列与未来视觉结果,解锁跨物体、跨技 ...
告别专家依赖,让机器人学会自我参考,仅需200步性能飙升至99.2%
具身智能之心· 2025-12-11 02:01
Core Insights - The article discusses the development of the Self-Referential Policy Optimization (SRPO) framework, which addresses the limitations of existing Visual Language Action (VLA) models in robotic tasks by enabling robots to learn from their own experiences without relying on external expert data [3][10][56]. Motivation and Contribution - SRPO aims to overcome the challenges of sparse reward signals in reinforcement learning, particularly in the VLA domain, by utilizing self-generated successful trajectories to provide progressive rewards for failed attempts [6][10]. - The framework eliminates the need for costly expert demonstrations and task-specific reward engineering, thus enhancing the efficiency of the learning process [10][12]. Technical Approach - SRPO collects trajectories generated during policy inference and categorizes them into successful and failed attempts, using a potential world representation to model behavior similarity [16][17]. - The framework employs a progressive reward mechanism based on the distance of failed trajectories to successful trajectory representations, allowing for a more nuanced evaluation of task progress [22][24]. Experimental Results - SRPO achieved a success rate of 99.2% in the LIBERO benchmark with only 200 steps of reinforcement learning, significantly outperforming traditional methods that rely on sparse rewards [29][30]. - In the LIBERO-Plus generalization tests, SRPO demonstrated a performance improvement of 167%, showcasing its robust generalization capabilities without the need for additional training data [31][32]. Efficiency and Real-World Application - The efficiency of SRPO is highlighted by its ability to improve success rates from 17.3% to 98.6% in long-term tasks with minimal training steps, outperforming other models in terms of training efficiency [36][39]. - The framework has been tested in real-world scenarios, showing significant improvements in success rates compared to supervised fine-tuning baselines [41][39]. Conclusion - SRPO represents a significant advancement in robotic learning, allowing for autonomous exploration and creativity by enabling robots to learn from their own successes and failures, thus paving the way for a new approach in VLA reinforcement learning [56].
深大团队让机器人精准导航!成功率可达72.5%,推理效率+40%
具身智能之心· 2025-12-11 02:01
Core Insights - The article discusses the introduction of the UNeMo framework for visual-language navigation (VLN), which significantly improves navigation success rates and reduces resource consumption compared to mainstream methods [4][10][33]. Group 1: UNeMo Framework Overview - UNeMo integrates a multi-modal world model (MWM) and a hierarchical predictive feedback navigator (HPFN) to address the disconnection between reasoning and decision-making in existing VLN methods [10][33]. - The framework allows navigation agents to predict future visual states based on current visual features and language instructions, enhancing decision-making capabilities [11][12]. Group 2: Performance Metrics - In experiments on the R2R dataset, UNeMo achieved a navigation success rate (SR) of 72.5% in unseen environments, surpassing NavGPT2's 71% by 1.5 percentage points [25]. - UNeMo's model parameters are only 30% of those used by NavGPT2, leading to a 56% reduction in GPU memory usage during training and a 40% increase in inference speed [23][24]. Group 3: Robustness in Complex Scenarios - UNeMo demonstrated a 5.6% increase in SR for long-path navigation (≥7 length), compared to a mere 1.2% increase for short-path navigation (<7 length), indicating its effectiveness in mitigating cumulative errors in long-distance tasks [28][29]. Group 4: Cross-Scenario Adaptability - The framework was tested across various navigation baselines and datasets, showing improved SR and remote goal success rates (RGS) in unseen scenarios, confirming its adaptability beyond LLM-based systems [31][32]. Group 5: Conclusion - UNeMo addresses the challenges of high resource consumption and the disconnection between reasoning and decision-making in traditional VLN methods, offering a lightweight yet high-performance solution for practical applications in service robotics and advancing the VLN field [33].
全部超越了π0、π0.5!端到端全身VLA模型Lumo-1:迈进推理-行动闭环时代
具身智能之心· 2025-12-11 02:01
点击下方 卡片 ,关注" 具身智能 之心 "公众号 编辑丨具身智能之心 本文只做学术分享,如有侵权,联系删文 >> 点击进入→ 具身智能之心 技术交流群 更多干货,欢迎加入国内首个具身智能全栈学习社区 : 具身智能之心知识星球 (戳我) , 这里包含所有你想要的。 让机器人「热面包」 在混乱桌面中快速找齐文具,还能精细处理不同形状、材质和尺寸的物品⚡️ 「把可乐放进蓝盘」 甚至推理出先用左臂,但遇障时换右手拿更快 从走路、跳舞到后空翻,动作模仿教会了机器人「怎么动」,而到端盘子、分拣水果、热食物等复杂操作时,机器人不能只模仿,更要识别复杂环境,理解「为什 么做」的任务意图,再转化为「动手这么做」的连贯操作。 人类的行动,一般都依托于上下文和意图,核心就在于推理。对机器人而言,尽管大规模互联网数据让GPT、DeepSeek等AI具备了不错的推理能力,但让AI在真实 物理世界里通过推理"准确动起来",特别是处理多步骤长时序任务、模糊指令、未见过情景时,依然挑战重重。 尽管没见过这块面包,机器人通过推理识别它,推理出加热=用微波炉,以及开门、拿起、放入、关门、旋钮、等待、取出……无需编程,全程推理完成! 「整理文具 ...