Workflow
具身智能之心
icon
Search documents
具身目标导航/视觉语言导航/点导航工作汇总!
具身智能之心· 2025-08-12 07:04
点击下方 卡片 ,关注" 具身智能 之心 "公众号 编辑丨具身智能之心 本文只做学术分享,如有侵权,联系删文 >> 点击进入→ 具身智能之心 技术交流群 更多干货,欢迎加入国内首个具身智能全栈学习社区 : 具身智能之心知识星球 (戳我) , 这里包含所有你想要 的。 最近有同学向我们咨询了一些具身导航相关的工作,今天也为大家梳理一下这几年发展的路线和方法论, 建议收藏。更多内容欢迎加入国内首个具身智能全栈学习社区:具身智能之心知识星球! 点目标导航工作汇总 Comparison of Model-Free and Model-Based Learning-Informed Planning for PointGoal Navigation RobustNav: Towards Benchmarking Robustness in Embodied Navigation 会议/年份:CoRL, 2022 论文链接:https://openreview.net/pdf?id=2s92OhjT4L 代码:https://github.com/yimengli46/bellman_point_goal 项目链接:ht ...
CMU最新!跨实体世界模型助力小样本机器人学习
具身智能之心· 2025-08-12 00:03
点击下方 卡片 ,关注" 具身智能 之心 "公众号 >>直播和内容获取转到 → 具身智能之心知识星球 点击按钮预约直播 通过模仿学习来训练视觉运动策略(visuomotor policies)在众多机器人领域已被证明是有效的。然而,这些策略的性能严重依赖于训练示范(demonstrations)的数 量,而这需要在现实世界中进行昂贵的数据收集。本研究的目标是, 在训练视觉运动机器人策略时,通过利用来自各种具身(embodiments)——例如公开的机器 人数据集和人类摆弄物体的数据集——的现成或低成本数据,来减少数据收集的工作量。 本文的方法基于两个关键见解: 具身无关的世界模型预训练: 本文使用光流(optic flow) 作为一种具身无关的动作表示(embodiment-agnostic action representation),在跨多个具身的数据集上预 训练一个世界模型(World Model, WM),然后仅用少量目标具身的机器人数据对其进行微调(finetune)。 潜在策略引导(LPS) : 提出了一种名为潜在策略引导(Latent Policy Steering, LPS) 的方法,通过在世 ...
探究具身机器人有限泛化能力的本质原因!增强策略依然有效
具身智能之心· 2025-08-12 00:03
点击下方 卡片 ,关注" 具身智能 之心 "公众号 作者丨 Youguang Xing等 编辑丨具身智能之心 本文只做学术分享,如有侵权,联系删文 >> 点击进入→ 具身智能之心 技术交流群 更多干货,欢迎加入国内首个具身智能全栈学习社区 : 具身智能之心知识星球 (戳我) , 这里包含所有你想要的。 研究背景与核心问题 近年来,随着大规模机器人数据集(如Open X-Embodiment/OXE)和高容量模型的发展,通用机器人策略在多种任务上展现出强大能力。然而,这些策略在面对训 练数据分布之外的场景时,泛化能力仍然有限。有趣的是,这种限制不能简单归因于数据量不足——OXE包含超过一百万段轨迹,远超典型视觉语言模型训练数 据集的规模。 研究者发现, 捷径学习 (shortcut learning)——模型依赖任务无关特征而非真正因果关系——是限制泛化能力的关键因素。如Figure 1所示,在SIMPLER环境中, 多个在OXE上训练的通用机器人策略在被要求"将勺子放在毛巾上"时,却一致执行"拿起可乐罐"这一仅在RT-1子数据集中存在的任务。这表明模型学习了与任务 无关的特征(如视角、背景)与动作之间的虚假相 ...
机器人上下文协议首次开源:阿里达摩院一口气放出具身智能「三大件」
具身智能之心· 2025-08-12 00:03
编辑丨 机器之心 https://github.com/alibaba-damo-academy/RynnEC 点击下方 卡片 ,关注" 具身智能之心 "公众号 >> 点击进入→ 具身 智能之心 技术交流群 更多干货,欢迎加入国内首个具身智能全栈学习社区 : 具身智能之心知识星球 (戳我) , 这里包含所有你想要的。 8 月 11 日,在世界机器人大会上,阿里达摩院宣布开源自研的 VLA 模型 RynnVLA-001-7B、世界理解模型 RynnEC、以及机器人上下文协议 RynnRCP ,推动数 据、模型和机器人的兼容适配,打通具身智能开发全流程。 开源链接: 机器人上下文协议 RynnRCP https://github.com/alibaba-damo-academy/RynnRCP 视觉 - 语言 - 动作模型 RynnVLA-001 https://github.com/alibaba-damo-academy/RynnVLA-001 世界理解模型 RynnEC 具身智能领域飞速发展,但仍面临开发流程碎片化,数据、模型与机器人本体适配难等重大挑战。 达摩院将 MCP(Model Context Pr ...
具身智能之心技术交流群成立了!
具身智能之心· 2025-08-11 06:01
注意哦, 备注:机构/学校+姓名+研究方向 ,能够快速入群! 感兴趣的同学可以添加小助理微信AIDriver005,邀请加入我们的社群。 具身智能之心技术交流群成立了!主要关注VLA、VLN、遥操作、Diffusion Policy、强化学习、VLA+RL、 sim2real、多模态大模型、仿真、运动控制、目标导航、建图定位、导航等方向。 ...
找几个做数采的大佬一起搞点事情......
具身智能之心· 2025-08-11 06:01
具身智能之心准备在国内外招募3位做数采的大佬,主要研究方向:遥操作、AR、全身动捕等方向。 相关研究方向至少1年,具身公司从业人员、博士及以上学历优先(包含在读博士)。 联系我们 更多待遇和工作内容咨询,欢迎添加负责人微信oooops-life了解。 工作内容 和我们一起承接具身数采相关的项目开发、课程开发等; 岗位要求 ...
国内首个具身智能全栈学习社区来啦!
具身智能之心· 2025-08-11 06:01
Core Insights - The article emphasizes the value of a community that provides solutions to problems in the field of embodied intelligence, highlighting the importance of knowledge sharing and collaboration among members [3][16]. Group 1: Community and Resources - The community has established a closed loop across various fields including industry, academia, job seeking, and Q&A exchanges, providing timely solutions and job opportunities [3][4]. - A comprehensive list of over 30 technical routes has been compiled, aiding members in finding benchmarks, reviews, and learning paths efficiently [4][16]. - The community invites industry experts to answer questions and share insights, enhancing the learning experience for members [4][17]. Group 2: Educational Support - Resources for beginners include curated technical stacks and learning paths to facilitate entry into the field of embodied intelligence [11][16]. - For those already engaged in research, valuable industry frameworks and project proposals are provided to support further development [13][16]. Group 3: Job Opportunities and Networking - The community has established a job referral mechanism with multiple companies in the embodied intelligence sector, ensuring members can connect with potential employers [10][17]. - Members are encouraged to engage in discussions about career choices and research directions, fostering a supportive environment for professional growth [79][83]. Group 4: Research and Development - The community has compiled a wealth of resources including open-source projects, datasets, and simulation platforms relevant to embodied intelligence, facilitating research and development efforts [16][30][36]. - A focus on various research directions such as visual language navigation, reinforcement learning, and multi-modal models is evident, indicating the community's commitment to staying at the forefront of technological advancements [20][58][70].
Genie Envisioner:面向机器人操作的统一世界基础平台
具身智能之心· 2025-08-11 00:14
Core Viewpoint - The article discusses the development of Genie Envisioner, a unified world foundation platform for robotic manipulation, which integrates strategy learning, evaluation, and simulation through a single video generation framework [3][27]. Group 1: Platform Overview - Genie Envisioner is built on a core component called GE-Base, which captures the spatial, temporal, and semantic dynamics of robot interactions [5][27]. - The platform includes GE-Act, a world action model that enables instruction-conditioned strategy reasoning, and GE-Sim, a video world simulator that supports closed-loop execution [6][21]. Group 2: Key Components - GE-Base is a large-scale video diffusion model that accurately captures real-world robot interaction features in a structured latent space [3][27]. - GE-Act utilizes a lightweight decoder with 160 million parameters to provide real-time control capabilities, achieving less than 10ms latency for diverse robotic tasks [15][27]. - GE-Sim constructs a high-fidelity environment for closed-loop strategy development, enhancing the framework's capabilities [21][27]. Group 3: Evaluation Framework - EWMBench is introduced as a standardized evaluation suite to assess the fidelity and utility of video-based world models in real-world robotic operations [23][27]. - The evaluation focuses on visual scene consistency, motion correctness, and semantic alignment, ensuring rigorous assessment of task-oriented scenarios [23][27]. Group 4: Training and Adaptation - The training process for GE-Base involves a large dataset with 1 million instruction-aligned video sequences, enabling robust model performance [11][27]. - GE-Act employs a three-phase training strategy to derive action strategies from the GE-Base model, optimizing for specific tasks and environments [17][19][27]. Group 5: Performance and Contributions - The integration of GE-Base, GE-Act, and GE-Sim has demonstrated superior performance in complex tasks such as fabric folding and packing, showcasing strong generalization capabilities [27]. - The platform establishes a powerful foundation for building general-purpose, instruction-driven embodied intelligence systems [27].
国内首个具身大脑+小脑算法实战全栈教程
具身智能之心· 2025-08-11 00:14
Core Viewpoint - The exploration towards Artificial General Intelligence (AGI) highlights embodied intelligence as a key direction, focusing on the interaction and adaptation of intelligent agents within physical environments [1][6]. Industry Analysis - In the past two years, numerous star teams in the field of embodied intelligence have emerged, establishing valuable companies such as Xinghaitu, Galaxy General, and Zhujidongli, driving advancements in embodied brain and cerebellum technologies [3]. - Major domestic companies like Huawei, JD.com, Tencent, Ant Group, and Xiaomi are actively investing and collaborating to build an ecosystem for embodied intelligence, while international firms like Tesla and investment institutions in the U.S. are supporting companies like Wayve and Apptronik in autonomous driving and warehouse robotics [5]. Technological Evolution - The development of embodied intelligence has progressed through several stages: - The first stage focused on grasp pose detection, which struggled with complex tasks due to a lack of context modeling [6]. - The second stage involved behavior cloning, allowing robots to learn from expert demonstrations but revealing weaknesses in generalization and performance in multi-target scenarios [6]. - The third stage introduced Diffusion Policy methods, enhancing stability and generalization in task execution through sequence modeling [7]. - The fourth stage, emerging in 2025, explores the integration of VLA models with reinforcement learning and tactile sensing, aiming to overcome limitations in feedback and future prediction capabilities [8]. Product and Market Development - The evolution from grasp pose detection to behavior cloning and advanced VLA models signifies a shift towards intelligent agents capable of performing complex tasks in open environments, leading to a surge in product development across various sectors such as industrial, home, dining, and healthcare [9]. - The demand for engineering and system capabilities is increasing as the industry transitions from research to deployment, necessitating higher engineering standards [12]. Educational Initiatives - A comprehensive curriculum has been developed to assist learners in mastering the full spectrum of embodied intelligence algorithms, covering topics from basic tasks to advanced models like VLA and its integrations [9][12].
聊聊DreamVLA:让机器人先看后想再动
具身智能之心· 2025-08-11 00:14
Core Viewpoint - The article introduces DreamVLA, a new Vision-Language-Action model that enhances robotic decision-making by integrating comprehensive world knowledge, allowing robots to predict dynamic environments and make more accurate action decisions [1][27]. Group 1: Background and Need for Advanced VLA Models - Traditional VLA models directly map visual inputs and language commands to actions, which can lead to interference from irrelevant information in complex environments [3][5]. - DreamVLA addresses this by adding a layer of "thinking" that predicts world knowledge, including dynamic areas, depth information, and semantic features before planning actions [5][27]. Group 2: Model Architecture and Functionality - DreamVLA operates on a "perception-prediction-action" cycle, treating the task as an inverse dynamics problem to derive necessary actions from predicted future states [7][27]. - The model processes three types of inputs: visual images, language commands, and the robot's own state, using dedicated encoders for each [10][14]. Group 3: World Knowledge Prediction - DreamVLA predicts world knowledge, which includes dynamic areas, depth maps, and semantic features, rather than directly predicting actions [11][18]. - Dynamic area prediction utilizes CoTracker to identify moving objects and generate masks that highlight relevant areas while filtering out static backgrounds [12][15]. - Depth prediction estimates the spatial relationships of objects, generating depth maps to assist in obstacle avoidance [13][17]. - Semantic prediction employs DINOv2 and SAM models to extract high-level semantic information, which is then encoded into a unified "world embedding" for action generation [18][22]. Group 4: Action Generation - The action generation component uses a diffusion Transformer to produce future action sequences based on the latent action embedding derived from multi-modal inputs [23][27]. - A structured attention mechanism is implemented to ensure coherent multi-step action reasoning and prevent cross-modal knowledge leakage [19][31]. Group 5: Performance and Validation - DreamVLA achieved an average task completion length of 4.44 in the CALVIN ABC-D benchmark, outperforming previous methods by 3.5%, with a real-world task success rate of 76.7% [25][27]. - Ablation studies confirmed the contributions of various components, demonstrating the model's robustness and generalization capabilities [25][31].