Workflow
具身智能之心
icon
Search documents
原力灵机提出ManiAgent!会 “动手”,会 “思考”,还会“采数据”!
具身智能之心· 2025-10-20 10:00
Core Insights - The article introduces ManiAgent, an innovative agentic framework designed for general robotic manipulation tasks, addressing limitations in existing Vision-Language-Action (VLA) models in complex reasoning and long-term task planning [1][2][26]. Group 1: Framework Overview - ManiAgent consists of multiple agents that collaboratively handle environment perception, sub-task decomposition, and action generation, enabling efficient responses to complex operational scenarios [2][10]. - The framework employs four key technologies: tool invocation, context engineering, real-time optimization, and automated data collection, creating a complete technical link from perception to action execution [8][12]. Group 2: Performance Metrics - In the SimplerEnv benchmark tests, ManiAgent achieved a task success rate of 86.8%, while in real-world pick-and-place tasks, the success rate reached 95.8% [2][10][28]. - The high success rates indicate that ManiAgent can serve as an effective automated data collection tool, generating training data that can match the performance of models trained on manually annotated datasets [2][10]. Group 3: Methodology - The framework includes four types of agents: 1. Scene perception agent, which generates task-relevant scene descriptions using visual language models [11]. 2. Reasoning agent, which evaluates task states and proposes achievable sub-tasks using large language models [11]. 3. Object-level perception agent, which identifies target objects and extracts detailed information for action generation [11]. 4. Controller agent, which generates executable action sequences based on sub-task descriptions and object details [11]. Group 4: Data Collection and Optimization - The automated data collection system is designed to operate with minimal human intervention, significantly reducing labor costs while ensuring high-quality data for VLA model training [12][21]. - The framework incorporates a context processing mechanism to enhance task relevance and information effectiveness, alongside a caching mechanism to reduce action generation delays [12][17]. Group 5: Experimental Results - In the SimplerEnv simulation environment, various tasks demonstrated an average success rate of 86.8%, with specific tasks achieving rates as high as 95.8% [22][28]. - Real-world experiments with the WidowX 250S robotic arm showed a range of tasks with success rates, indicating the framework's versatility across different operational contexts [25][28].
具身智能之心交流群成立来!VLA/RL/导航/数采等多个方向
具身智能之心· 2025-10-20 10:00
Group 1 - The establishment of a technical exchange group focused on embodied intelligence has been announced, inviting participation from various stakeholders in the field [1] - The group encompasses nearly 20 sub-directions, indicating a broad scope of interest and expertise within the embodied intelligence domain [1] - Participants are encouraged to engage in discussions related to humanoid robots, quadrupeds, robotic arms, and various advanced technologies such as VLA, large models, VLN, reinforcement learning, mobile operations, multi-modal perception, simulation, and data collection [1]
我们的具身社区,最近又增加了很多模块~
具身智能之心· 2025-10-20 03:29
增加了很多模块,我们的具身智能社区进一步完善了! 9月和10月一直在补充我们的具身社区版块,重点增加了VLA、real2sim2real、移动操作、世界模 型、域适应等任务,当然还有很多高质量的直播。 除此之外,目前正在给大家更新一些开源的方案与硬件,后期我们期望能在这些方案的基础上做一 些分享,让每个同学都能完成自己的project。 近一年的搭建,我社区内已经完成了技术路线分享、直播、问答、求职、赛事等多个版块的分享。 实现了产业、学术、求职、问答交流等多个领域的闭环。 1)持续的直播分享 社区为大家准备了很多圆桌论坛、直播,从本体、数据到算法,各类各样,逐步为大家分享具身行 业究竟在发生什么?还有哪些问题待解决。 2)完整的技术路线 针对入门者,我们整理了许多为小白入门的技术栈和路线。 3)产业&项目相关的方案 已经从事相关研究的同学,我们也给大家提供了很多有价值的产业体系和项目方案。 4)内推与求职 星球还和多家具身公司建立了岗位内推机制,欢迎大家随时艾特我们。第一时间将您的简历送到心 仪公司的手上。 **更有料的是:无论你是要找benchmark、还是要找综述和学习入门路线,都能极大缩短检索时间。 ...
MuJoCo教程来啦!从0基础到强化学习,再到sim2real
具身智能之心· 2025-10-20 00:03
Core Insights - The article emphasizes that the field of AI is at a pivotal moment, transitioning from early symbolic reasoning to deep learning breakthroughs and now to the rise of embodied intelligence, which is redefining human-machine relationships [1][3]. Group 1: Embodied Intelligence - Embodied intelligence is characterized by machines that can understand language commands, navigate complex environments, and make intelligent decisions in real-time, moving beyond the realm of virtual space [1]. - Major tech companies like Tesla, Boston Dynamics, OpenAI, and Google are actively developing technologies in this disruptive field, indicating a competitive landscape [1][3]. - The potential impact of embodied intelligence spans across various industries, including manufacturing, healthcare, and space exploration, suggesting a transformative effect on the economy and society [1]. Group 2: Technical Challenges and Solutions - Achieving true embodied intelligence presents unprecedented technical challenges, requiring advancements in algorithms, physical simulation, robot control, and perception fusion [3]. - MuJoCo (Multi-Joint dynamics with Contact) is highlighted as a critical technology for embodied intelligence, serving as a high-fidelity simulation engine that connects virtual and real-world environments [4][6]. - MuJoCo allows researchers to conduct millions of trials in a simulated environment, significantly accelerating the learning process while minimizing risks associated with physical hardware [6][8]. Group 3: MuJoCo's Advantages - MuJoCo's advanced contact dynamics algorithms enable precise simulation of complex interactions between robots and their environments, making it a standard tool in both academia and industry [4][8]. - The engine supports high parallelization, allowing thousands of simulations to run simultaneously, which enhances efficiency in training AI systems [4][6]. - The technology's stability and numerical accuracy ensure reliable long-term simulations, making it a preferred choice for leading tech companies [4][6]. Group 4: Educational Initiatives - A comprehensive MuJoCo development tutorial has been created, focusing on practical applications and theoretical foundations within the context of embodied intelligence [9][11]. - The course is structured into six modules, each with specific learning objectives and practical projects, ensuring a thorough understanding of the technology stack [15][17]. - Participants will engage in hands-on projects that cover a range of applications, from basic robotic arm control to complex multi-agent systems, fostering both theoretical knowledge and practical skills [19][29]. Group 5: Target Audience and Outcomes - The course is designed for individuals with programming or algorithm backgrounds looking to enter the field of embodied robotics, as well as students and professionals seeking to enhance their practical capabilities [32][33]. - Upon completion, participants will possess a complete skill set in embodied intelligence, including proficiency in MuJoCo, reinforcement learning, and real-world application of simulation techniques [32][33]. - The program aims to cultivate a combination of technical, engineering, and innovative skills, preparing participants to tackle complex problems in the field [33].
稳定训练、数据高效,清华大学提出「流策略」强化学习新方法SAC Flow
具身智能之心· 2025-10-20 00:03
Core Viewpoint - The article introduces a new approach called SAC Flow, which utilizes a high data efficiency reinforcement learning algorithm to train flow-based policies end-to-end without the need for alternative objectives or policy distillation. This method achieves high data efficiency and state-of-the-art performance on various benchmarks [1][4][20]. Group 1: Research Background - Flow-based policies are gaining popularity in the field of robotic learning due to their ability to model multi-modal action distributions and their simplicity compared to diffusion strategies. They are widely used in advanced VLA models [4]. - Previous attempts to train flow policies using off-policy reinforcement learning (RL) often faced issues such as gradient explosion due to the multi-step sampling process inherent in flow policies [4][5]. Group 2: Methodology - The proposed SAC Flow treats flow policies as sequential models, allowing the use of modern recurrent structures like GRU and Transformer to stabilize training and optimize flow policies directly within an off-policy framework [7][10]. - SAC Flow incorporates Gaussian noise and drift correction in each rollout to ensure the end action distribution remains unchanged, allowing the actor/critic loss to be expressed using the log-likelihood of multi-step sampling from the flow policy [14]. Group 3: Training Paradigms - Two training paradigms are supported: - From-scratch training for dense-reward tasks, where SAC Flow can be trained directly [18]. - Offline-to-online training for sparse-reward tasks, where pre-training on a dataset is followed by online fine-tuning [18][20]. Group 4: Experimental Results - SAC Flow-T and Flow-G demonstrated stable and faster convergence in environments like Hopper, Walker2D, and Ant, achieving state-of-the-art performance [20][21]. - The offline-to-online training results showed that SAC Flow maintains stable gradients and prevents gradient explosion, leading to superior performance compared to naive SAC training [24][26]. Group 5: Comparison with Similar Works - SAC Flow outperforms existing methods like FlowRL and diffusion strategies in terms of convergence speed and efficiency, particularly in challenging sparse-reward tasks [30][31]. - The method retains the modeling capabilities of flow policies without the need for distillation into single-step models, which is a common approach in other methods [31]. Group 6: Key Takeaways - The key attributes of SAC Flow are serialization, stable training, and data efficiency, enabling the direct use of off-policy RL algorithms to train flow policies effectively [32].
移动操作&双臂操作开源硬件与方案
具身智能之心· 2025-10-20 00:03
Core Viewpoint - The article emphasizes the importance of open-source projects in advancing mobile and dual-arm robotic operations, highlighting their role in breaking down technical barriers and accelerating innovation in various applications, from household robots to industrial automation [3]. Group 1: Open-Source Projects Overview - XLeRobot, developed by Nanyang Technological University, focuses on flexible movement and precise operation in complex environments, providing a reference framework for mobile and dual-arm control [4]. - AhaRobot from Tianjin University emphasizes autonomy and environmental adaptability in dual-arm operations, integrating perception, planning, and control modules for service robots [6]. - ManiGaussian++, released by Tsinghua University, optimizes dual-arm operation accuracy using Gaussian models, particularly in 3D environment perception and motion planning [8]. - H-RDT, a collaboration between Tsinghua University and Horizon Robotics, aims at efficient decision-making and real-time operations for mobile robots in various settings [11]. - RoboTwin 2.0, developed by Shanghai Jiao Tong University and the University of Hong Kong, integrates simulation and physical platforms for mobile and dual-arm operations [14]. - Open X-Embodiment, from Arizona State University, focuses on a generalized learning framework for robotic operations, supporting cross-scenario skill transfer [16]. - 3D FlowMatch Actor, a joint project by Carnegie Mellon University and NVIDIA, enhances dynamic adaptability in 3D space for mobile and dual-arm operations [19]. - OmniH2O, developed by Carnegie Mellon University, focuses on human-robot action mapping and humanoid operation, facilitating remote control and action teaching [24]. - TidyBot++, a collaboration between Princeton University and Stanford University, targets household organization tasks, integrating object recognition and dual-arm collaboration algorithms [27]. - robosuite, from the University of California, Berkeley, is a mature simulation platform for robotic operations, providing standardized tasks and evaluation tools [29]. - SO-ARM100, a standardized dual-arm operation hardware and software solution, aims to lower development barriers for educational and research purposes [32]. - GOAT, developed by UIUC and CMU, focuses on goal-directed movement and operation for robots, emphasizing robustness and versatility [34]. - Mobile ALOHA, from Stanford University, combines mobile chassis and dual-arm operations for low-cost, easily deployable service robots [35].
只需少量演示即可灵活应对多样物体!阿米奥冯骞团队携低成本精准灵巧操作方案亮相IROS!
具身智能之心· 2025-10-20 00:03
点击下方 卡片 ,关注" 具身智能 之心 "公众号 先来看一段视频。 ★ 该项成果的一作为阿米奥联合创始人兼技术负责人冯骞,硕博均就读于德国慕尼黑工业大学,师从机 器人泰斗Alois Knoll,曾是思灵机器人早期员工、研究科学家。本次IROS2025,冯博将会在 Deep Learning in Grasping and Manipulation论坛上针对这项工作发表演讲。 机器人灵巧操作领域研究进展 领域痛点 机器人灵巧操作(如多手指抓取)是实现 "类人机器人" 的关键,但现有方案存在三大核心问题: 当机器人面对陌生物体, 如何靠少量演示、单视角观测就精准抓取? LensDFF ,这项由阿米奥机器人给出 了颠覆性方案—— 它跳出传统 "依赖多视角数据、额外训练对齐网络" 的思路,直接用语言特征作为 "语义 锚点",将 CLIP 提取的 2D 视觉特征,通过动态投影公式对齐到 3D 空间,从根源解决跨视角特征不一致 问题,且全程无需微调。 更关键的是,它把 5 种抓取原语(捏 / 钩 / 三脚架等)融入少样本演示,搭配 "法向量引导初始化 + 低维 eigengrasp 优化",让 DLR-HIT 灵巧手能 ...
端到端基础模型!VCoT-Grasp: 视觉思维链增强的机器人抓取检测大模型
具身智能之心· 2025-10-19 13:50
>> 点击进入→ 具身智能之心 技术交流群 更多干货,欢迎加入国内首个具身智能全栈学习社区 : 具身智能之心知识星球 (戳我) , 这里包含所有你想要的。 点击下方 卡片 ,关注" 具身智能 之心 "公众号 编辑丨具身智能之心 本文只做学术分享,如有侵权,联系删文 思维链 (Chain-of-Thought, CoT) 是一种通过中间思考步骤增强大语言模型推理能力的方法。视觉思维链 (Visual Chain-of-Thought, VCoT) 将思维链从文 本模态扩展到图像模态,以图像作为中间思考步骤,被用来提升多模态大模型的思考能力。 (a)基于多模态融合的方法, (b)使用LLM/VLM提供指引的模块化方法, (c)带有语言推理能力的端到端多模态大模型方法, (d)我们的方法,引入视觉推理能 力,以目标的bounding box图像作为思考步骤。 VCoT-Grasp构建了一个端到端的基础模型,并引入视觉思维链来增强视觉理解能力。实际运行中,模型以目标物品的bounding box图像作为中间思考步 骤,首先预测目标的bounding box作为粗粒度位置,之后目标区域的图像被裁剪并输入模型用以提供细粒 ...
ROSCon China 2025 大会议程公布!
具身智能之心· 2025-10-18 16:03
以下文章来源于古月居 ,作者古月居 古月居 . 专业的ROS机器人知识社区和产业服务平台 编辑丨 古月居 点击下方 卡片 ,关注" 具身智能之心 "公众号 >> 点击进入→ 具身 智能之心 技术交流群 更多干货,欢迎加入国内首个具身智能全栈学习社区 : 具身智能之心知识星球 (戳我) , 这里包含所有你想要的。 2025年机器人领域的"技术盛宴"定档! ROSCon China 2025 将于 10月31日—11月1日 ,在 上海虹桥新华联索菲特大酒店 正式启幕,现在,万众期待的 「完整议程」 终于 重磅公布 ! 无论是想紧跟ROS技术前沿,还是要解决工程落地难题,这场大会都能满足你——两天时间里,核心开发者、产业领袖、资深工程团队将齐聚现场,带来从"技术深 度"到"落地实效"的全维度内容。 谁该来?这几类人一定要占座 • 机器人开发者: 获取ROS最新技术动态,解决开发中的"卡脖子"问题; • 企业技术负责人: 对接产业落地案例,找到适合自身业务的技术方案; • 高校科研人员: 链接行业资源,让研究成果更快对接实际应用; • 机器人爱好者: 近距离接触前沿技术,打开行业认知新视野。 现在议程已公开,席位有限 ...
港科广&清华联合提出Spatial Forcing:隐式空间对齐,超越主流2D/3D VLA模型性能
具身智能之心· 2025-10-18 16:03
Core Insights - The article discusses the limitations of current Vision-Language-Action (VLA) models that primarily rely on 2D visual data, lacking a deep understanding of real 3D space, which hampers their ability to perform tasks in the physical world [2][4] - The proposed method, Spatial Forcing (SF), allows VLA models to develop spatial understanding without explicit 3D input by aligning visual features with a powerful 3D geometric representation generated by an external model [2][10] Methodology - The SF method employs an implicit spatial alignment strategy, enabling the model to autonomously acquire spatial understanding during training without the need for additional 3D sensors [2][13] - A depth probing experiment was conducted to verify the presence of 3D information in the original VLA's visual features, revealing that without 3D input, the model cannot form accurate spatial perceptions [11][13] - The training process involves aligning the VLA model's visual tokens with pixel-level spatial representations extracted from a pre-trained 3D model, optimizing both spatial alignment loss and action generation loss [16] Performance Results - The SF method significantly outperforms existing 2D and 3D VLA models in various tasks, achieving a training efficiency improvement of up to 3.8 times and a data utilization efficiency increase of up to 5.9 times [14] - In experiments, the Spatial Forcing model achieved a success rate of 99.4% in spatial tasks, 99.6% in object tasks, and 98.8% in goal tasks, demonstrating its superior performance compared to other models [18]