Workflow
具身智能之心
icon
Search documents
扩散世界模型LaDi-WM大幅提升机器人操作的成功率和跨场景泛化能力
具身智能之心· 2025-08-18 00:07
在机器人操作任务中,预测性策略近年来在具身人工智能领域引起了广泛关注,因为它能够利用预测状态来提升机器人的操作性能。然而,让世界模型预测机器 人与物体交互的精确未来状态仍然是一个公认的挑战,尤其是生成高质量的像素级表示。 为解决上述问题, 国防科大、北京大学、深圳大学团队 提出 LaDi-WM(Latent Diffusion- based World Models) ,一种基于隐空间扩散的世界模型,用于预测隐 空间的未来状态。 具体而言,LaDi-WM 利用预训练的视觉基础模型 (Vision Fundation Models) 来构建隐空间表示,该表示同时包含几何特征(基于 DINOv2 构造)和语义特征(基于 Siglip 构造),并具有广泛的通用性,有利于机器人操作的策略学习以及跨任务的泛化能力。 编辑丨机器之心 点击下方 卡片 ,关注" 具身智能之心 "公众号 >> 点击进入→ 具身 智能之心 技术交流群 更多干货,欢迎加入国内首个具身智能全栈学习社区 : 具身智能之心知识星球 (戳我) , 这里包含所有你想要的。 基于 LaDi-WM,团队设计了一种扩散策略,该策略通过整合世界模型生成的预测状态来 ...
中山&清华:基于大模型的具身智能系统综述
具身智能之心· 2025-08-16 16:03
Core Viewpoint - The article provides a comprehensive overview of embodied intelligence systems based on large models, highlighting their applications, challenges, and future directions in various domains such as home services, healthcare, education, and industry [6][39]. Summary by Sections Perception and Understanding - Embodied intelligence systems utilize sensors like cameras and microphones to receive raw data and interpret it to form environmental awareness. Large models excel in processing multimodal input data, effectively integrating text, images, and audio to capture relationships and extract high-dimensional features for understanding the world [5][6]. - Multimodal models, such as GPT-4V, enhance the understanding of environments by encoding images and text into a shared vector space, facilitating perception and comprehension of user instructions [9]. Control Levels - The control levels of embodied intelligence systems are categorized into demand level, task level, planning level, and action level, each with representative works that demonstrate the application of large models [6][11]. System Architecture - The architecture of embodied intelligence systems includes end-to-end Transformer architectures and combinations of frozen parameter large models with foundational models, allowing for flexible optimization without sacrificing generalization [21][29]. Data Sources - Data sources for training embodied intelligence systems include simulators, imitation learning, and video learning, with simulators providing a controlled environment for rapid data collection and testing [31][32]. Challenges - Key challenges faced by embodied intelligence systems include the scarcity of real-world data, slow inference speeds, and the need for multi-agent collaboration in complex tasks [39][40]. Future Development Directions - Future directions for embodied intelligence systems involve improving data collection methods, optimizing large models for faster inference, enhancing multi-agent collaboration, and expanding applications across various fields [41][44].
迟迟入不了具身的门?别人在这里已经弯道超车了......
具身智能之心· 2025-08-16 16:03
昨天下午有个同学找峰哥吐槽,刚入职某具身公司,老大让调试机器人,不知道怎么做数据采集和调试, 自由度太多了。如何分析问题也是一头雾水,在校跑跑demo还可以,真的上手真机了,坑还是很多。 这类问题前面在咱们的具身社区里面已经碰到过多次了,如何使用设备?如何有效采集数据?如何部署 VA、VLA模型等。是采集背景太复杂还是数据比较dirty? 后面我们也很快给他相关答复,快速用到项目里 面了。 一个社区能在大家最需要帮助的时候解决问题,无疑是非常有价值的。具身智能之心知识星球(国内首个 具身全栈技术社区),目前已经完成了产业、学术、求职、问答交流等多个领域的闭环。遇到什么问题就 分享什么解决方案,哪块研究最前沿,就给大家源源不断提供解决思路,还有求职岗位第一时间对接给大 家!除了上面的问题,我们还为大家梳理了很多其它的内容: 机器人仿真和数据采集有哪些平台? 人形机器人怎么做模仿学习?VLA为什么难做? VLA在机器人抓取与规划任务中是怎么用的? VLA+RL是怎么做的?为什么work? ...... 更有料的是: 星球内部为大家梳理了近30+技术路线,无论你是要找benchmark、还是要找综述和学习入门 路线 ...
ICCV 2025 | HERMES:首个统一3D场景理解与生成的世界模型
具身智能之心· 2025-08-16 16:03
编辑丨 机器之心 点击下方 卡片 ,关注" 具身智能之心 "公众号 >> 点击进入→ 具身 智能之心 技术交流群 更多干货,欢迎加入国内首个具身智能全栈学习社区 : 具身智能之心知识星球 (戳我) , 这里包含所有你想要的。 本文第一作者周鑫、共同第一作者梁定康,均为华中科技大学博士生,导师为白翔教授。合作者包括华中科技大学涂思凡,旷视科技丁宜康,迈驰智行陈习武、 谭飞杨,香港大学赵恒爽助理教授。 在复杂的城市场景中,HERMES 不仅能准确预测未来三秒的车辆与环境动态(如红圈中标注的货车),还能对当前场景进行深度理解和问答(如准确识别出 "星 巴克" 并描述路况)。 论文标题:HERMES: A Unified Self-Driving World Model for Simultaneous 3D Scene Understanding and Generation 论文地址:https://arxiv.org/abs/2501.14729 研究背景与动机 近年来,自动驾驶技术取得了令人瞩目的进展。要让智能汽车安全高效地行驶在复杂的真实道路上,车辆必须具备两大核心能力: 对 当前环境 的深刻理解 (例 如 ...
在复杂真实场景中评估 π0 这类通用 policy 的性能和边界
具身智能之心· 2025-08-16 16:03
点击下方 卡片 ,关注" 具身智能 之心 "公众号 作者丨 Jie Wang等 编辑丨具身智能之心 本文只做学术分享,如有侵权,联系删文 >> 点击进入→ 具身智能之心 技术交流群 更多干货,欢迎加入国内首个具身智能全栈学习社区 : 具身智能之心知识星球 (戳我) , 这里包含所有你想要的。 blog:https://penn-pal-lab.github.io/Pi0-Experiment-in-the-Wild/ 这是 GRASP Lab 的一篇在复杂真实场景中(in the wild)评估 PI0-FAST-DROID 的工作,这样可以更直观的帮助理解 PI0 这类通用 policy 的目前性能和边界,以 及探索未来可以解决的方向。 当然现在还有更新一代的 PI0.5 方案(但是还没有开源)。 相关资料 : Droid 数据集 :https://droid-dataset.github.io/ 引言: 机器人操作领域,一直以来都缺少能够"开箱即用"地处理新物体、新位置和新任务的预训练模型 。机器人专家们往往曾经历过令人沮丧的过程:为了获取一个 机器人 policy,不得不进行繁琐的工程设计和数据收集,结 ...
灵巧手的设计与难题!为什么它是打通“手-眼-脑”感知闭环的关键技术?
具身智能之心· 2025-08-15 16:03
点击下方 卡片 ,关注" 具身智能 之心 "公众号 >>直播和内容获取转到 → 具身智能之心知识星球 点击按钮预约直播 在人形机器人迈向真实世界交互的征途中,末端执行器——尤其是五指灵巧手,正从"形态仿生"走向"功能仿生"的深水区。一只真正具备科研价值与产业潜力的灵 巧手,远不止于拥有五个手指的外形。它至少应具备三大核心特征: 高物理灵巧度(IOD)、多模态感知能力(IOS) 和 智能决策潜力(IOI)。 当前灵巧手的传动方案大体呈现"三足鼎立"之势: 连杆传动结构刚性强、定位精度高,适合工业夹爪等重复性任务,但难以实现高自由度集成;齿轮传动紧凑可控,常见于三指欠驱动手,但在力传递效率与被动 柔顺性上受限;而被特斯拉Optimus、Shadow Hand等采用的绳驱(腱绳传动),则被视为最接近"第一性原理"的类人路径。其优势在于轻量化、远距离力传输与 天然的被动柔顺性,契合具身智能时代"预测驱动+动态调整"的控制范式。然而,绳驱也面临摩擦损耗、预紧力衰减与系统集成复杂三大工程难题——数十根高强 度编织腱绳需在狭小手掌空间内精确布线,任一断裂即可能导致系统级维护,这对材料选型、滑轮工艺与张力补偿机制提出极致挑战 ...
天大&清华最新!GeoVLA:增强VLA模型的3D特征提取能力,鲁棒提升明显(SOTA)
具身智能之心· 2025-08-15 00:05
Core Insights - The article introduces GeoVLA, a novel framework that integrates 3D information into Vision-Language-Action (VLA) models, enhancing robots' spatial perception and adaptability [3][9][10]. Group 1: Background and Motivation - The advancement of robotic operations requires intelligent interaction and precise physical control in real-world environments. Recent VLA models have gained attention for their ability to follow instructions and execute actions [7]. - Current VLA models primarily rely on 2D visual inputs, neglecting the rich geometric information inherent in the 3D physical world, which limits their spatial perception capabilities [8]. Group 2: GeoVLA Framework - GeoVLA employs a visual-language model (VLM) to process images and language instructions, extracting fused visual-language embeddings. It converts depth maps into point clouds and uses a custom point embedding network to generate 3D geometric embeddings [3][10][12]. - The framework consists of three key components: VLM for general understanding, a point embedding network (PEN) for extracting fine-grained 3D features, and a 3D enhanced action expert (3DAE) for generating action sequences [12][13]. Group 3: Performance Evaluation - GeoVLA was evaluated on the LIBERO and ManiSkill2 benchmarks, achieving state-of-the-art results. It demonstrated significant robustness in real-world tasks requiring high adaptability and spatial awareness [15][27]. - In LIBERO, GeoVLA achieved an average success rate of 97.7%, outperforming other models like CogACT (93.2%) and OpenVLA-OFT (95.3%) [27]. - In the ManiSkill2 benchmark, GeoVLA achieved a success rate of 77%, surpassing CogACT (69%) and Dita (66%) [27]. Group 4: Ablation Studies - Ablation studies indicated that the PEN encoder outperformed traditional encoders, achieving a success rate of 97.7% compared to 95.8% for MLP and 95.2% for PointNet [30]. - The use of static routing in the MoE architecture improved performance, demonstrating the effectiveness of the design in leveraging multimodal information [30][20]. Group 5: Real-World Experiments - Real-world experiments showcased GeoVLA's robustness and generalization capabilities across various 3D manipulation tasks, maintaining high performance despite changes in camera perspective, height, and object size [36][34]. - GeoVLA achieved an average success rate of 86.3% across basic and 3D perception tasks, outperforming other models by significant margins [36].
Figure人形机器人首秀灵巧手叠衣服!只增加数据集就搞定
具身智能之心· 2025-08-15 00:05
Core Viewpoint - Figure's humanoid robot has successfully learned to fold clothes using an end-to-end approach without any architectural changes, showcasing its adaptability and advanced capabilities in handling complex tasks [2][21][28]. Group 1: Robot Capabilities - The humanoid robot demonstrated its ability to fold towels smoothly, employing precise finger control and real-time adjustments during the process [7][18]. - This task is considered one of the most challenging dexterous operations for humanoid robots due to the variability and unpredictability of clothing shapes [15][16]. - The robot's performance in folding clothes was achieved using the same model and architecture as its previous task of package sorting, with the only change being the dataset used for training [14][28]. Group 2: Helix Architecture - The Helix architecture, developed after Figure's split from OpenAI, is a unified "visual-language-action" model that allows the robot to perceive, understand, and act like a human [21][22]. - Helix consists of two systems that communicate with each other, enabling the robot to perform various tasks with a single set of neural network weights [22]. - Key components of Helix include visual memory, state history, and force feedback, which enhance the robot's ability to adapt and respond to its environment [23][29]. Group 3: Future Plans - Figure plans to continue improving the robot's flexibility, speed, and generalization capabilities based on the expansion of real-world data [20]. - The company aims to develop the robot's ability to perform a complete set of household tasks, including washing, folding, and potentially hanging clothes [38].
何为Agent?在思想、学术与工程领域探寻“好用”真义
具身智能之心· 2025-08-15 00:05
点击下方 卡片 ,关注" 具身智能 之心 "公众号 >>直播和内容获取转到 → 具身智能之心知识星球 点击按钮预约直播 当你打开电脑或手机想要执行一连串的任务的时候,你最先想到的会是什么? AI Agent 一定能成为前三。 AI Agent 的能力源于" 大模型(大脑)+ 记忆(向量数据库)+ 规划(目标拆解)+ 工具(API 调用) "的协同运作。agent的提出,让AI逐渐脱离"单一工具"、迈向 有一定"自主意识"和能力的智能工具集合体。 它不仅是技术的产物,更是人类对智能本质的探索 。 探索会遇到不少惊喜,比如之前一句吩咐只能让旅游智能助手干一件事:给出游玩路线!现在,有了旅游智 能助手升级为AI agent旅行管家后,除了给出路线图,还能帮你预订机票、设置提醒、推荐沿途美食。除了惊喜,可能也会有惊吓!比如机票预订错了,路线规划 得无比复杂而直接罢工......因此本期圆桌中,我们既会盘点不断进化的AI Agent带给我们的惊喜,也会"暴论"在手搓AI Agent中踩过的种种坑和遭遇的各色各样心 累。 本期 《何为Agent?在思想、学术与工程领域探寻"好用"真义》是具身智能之心的又一次力作! 首先将 ...
告别无效科研!具身智能方向1v1辅导开放,3位导师带你冲刺顶会!
具身智能之心· 2025-08-15 00:05
点击下方 卡片 ,关注" 具身智能 之心 "公众号 具身智能之心1v1论文辅导来啦!现在有3个vla、强化学习、sim2real方向的名额,主要面向A会和B会。 主要会议:CVPR、ICCV、ECCV、ICLR、CoRL、ICML、ICRA等; 辅导老师:积极活跃在具身学术领域,有idea。 感兴趣的同学可以添加微信oooops-life咨询,或者直接扫码,备注具身论文辅导咨询。 ...