Workflow
具身智能之心
icon
Search documents
Kitchen-R :高层任务规划与低层控制联合评估的移动操作机器人基准
具身智能之心· 2025-08-25 00:04
Core Viewpoint - The article introduces the Kitchen-R benchmark, a unified evaluation framework for task planning and low-level control in embodied AI, addressing the existing fragmentation in current benchmarks [4][6][8]. Group 1: Importance of Benchmarks - Benchmarks are crucial in various fields such as natural language processing and computer vision for assessing model progress [7]. - In robotics, simulator-based benchmarks like Behavior-1K are common, providing model evaluation and training capabilities [7]. Group 2: Issues with Existing Benchmarks - Current benchmarks for high-level language instruction and low-level robot control are fragmented, leading to incomplete assessments of integrated systems [8][9]. - High-level benchmarks often assume perfect execution of atomic tasks, while low-level benchmarks rely on simple single-step instructions [9]. Group 3: Kitchen-R Benchmark Features - Kitchen-R fills a critical gap in embodied AI research by providing a comprehensive testing platform that closely simulates real-world scenarios [6][8]. - It includes a digital twin kitchen environment and over 500 language instructions, supporting mobile ALOHA robots [9][10]. - The benchmark supports three evaluation modes: independent evaluation of planning modules, independent evaluation of control strategies, and critical full system integration evaluation [9][10]. Group 4: Evaluation Metrics - Kitchen-R is designed with offline independent evaluation and online joint evaluation metrics to ensure comprehensive system performance measurement [16][20]. - Key metrics include Exact Match (EM) for task planning accuracy and Mean Squared Error (MSE) for trajectory prediction accuracy [20][21]. Group 5: Baseline Methods - Kitchen-R provides two baseline methods: a VLM-driven task planning baseline and a Diffusion Policy low-level control baseline [43][49]. - The VLM planning baseline enhances planning accuracy through contextual examples and constrained generation [47][48]. - The Diffusion Policy baseline integrates visual features and robot states to predict future actions [49][52]. Group 6: Future Directions - Kitchen-R can expand to include more complex scenarios, such as multi-robot collaboration and dynamic environments, promoting the application of language-guided mobile manipulation robots in real-world settings [54].
一文尽览!2025年多篇VLA与RL融合的突破方向
具身智能之心· 2025-08-25 00:04
Core Viewpoint - The article discusses a significant revolution in the field of robotic embodied intelligence, focusing on the integration of Vision-Language-Action (VLA) models with Reinforcement Learning (RL) to address core challenges in real-world robotic decision-making and task execution [2][57]. Group 1: GRAPE - The GRAPE framework enhances the generalization of robot policies through preference alignment, addressing the limitations of VLA models in task adaptability and generalization [4][5]. - GRAPE improves the success rate of in-domain tasks by 51.79% and out-of-domain tasks by 58.20%, while also reducing collision rates by 37.44% under safety objectives [7][8]. Group 2: VLA-RL - The VLA-RL framework utilizes trajectory-level RL expressions to model operation trajectories and fine-tune reward models to tackle sparse rewards, enhancing task performance and demonstrating signs of reasoning expansion [10][12]. - In evaluations across 40 challenging robotic tasks, VLA-RL significantly outperformed existing models, indicating its potential for scalable applications [14]. Group 3: ReWiND - The ReWiND framework allows for the adaptation of robot policies to unseen tasks using a pre-trained language-based reward function, improving generalization and sample efficiency without the need for new demonstrations [17][18]. - ReWiND shows a 2.4 times improvement in reward generalization and a 5 times increase in performance for pre-trained dual-arm strategies in real-world scenarios [20]. Group 4: ConRFT - The ConRFT method employs a two-phase reinforcement fine-tuning approach to stabilize the supervision of VLA models, significantly increasing the success rate of practical tasks to 96.3% with a 144% improvement over previous methods [23][28]. - The model requires only 45 to 90 minutes of online fine-tuning to achieve these results, demonstrating its efficiency [28]. Group 5: RLDG - The RLDG method enhances the performance of generalist robot policies by generating high-quality training data through reinforcement learning, addressing the limitations of human demonstration data [32][33]. - In practical experiments, RLDG achieved a 40% increase in success rates for precise operation tasks, showcasing its effectiveness in improving generalization capabilities [38]. Group 6: TGRPO - The TGRPO method integrates trajectory-level group relative policy optimization to enhance the robustness and efficiency of VLA model fine-tuning in new environments [39][43]. - TGRPO consistently outperformed various baseline methods across ten operational tasks, validating its effectiveness in improving VLA model adaptability [43]. Group 7: iRe-VLAd - The iRe-VLAd framework optimizes VLA models through iterative reinforcement and supervised learning, addressing the instability and computational burden of direct online RL applications [44][48]. - This approach has been validated in multiple simulated and real-world scenarios, proving its capability to enhance performance in interactive settings [50]. Group 8: RIPT-VLA - The RIPT-VLA method introduces interactive post-training for VLA models, utilizing sparse binary success rewards to improve adaptability in low-data environments [51][54]. - This framework has shown significant improvements in compatibility, efficiency, and generalization, achieving a 97% success rate with minimal supervision [56]. Conclusion - The eight studies collectively represent a pivotal advancement in robotic intelligence, focusing on overcoming industry challenges such as task generalization, adaptability to dynamic environments, and multimodal information integration, with practical applications in home automation, industrial assembly, and robotic manipulation [57].
3个月!完成你的具身大脑+小脑算法学习
具身智能之心· 2025-08-25 00:04
Core Viewpoint - The exploration of Artificial General Intelligence (AGI) is increasingly focusing on embodied intelligence, which emphasizes the interaction and adaptation of intelligent agents within physical environments, enabling them to perceive, understand tasks, execute actions, and learn from feedback [1][6]. Industry Analysis - In the past two years, numerous star teams in the field of embodied intelligence have emerged, leading to the establishment of valuable companies such as Xinghaitu, Galaxy General, and Zhujidongli, which are advancing the technology of embodied intelligence [3]. - Major domestic companies like Huawei, JD.com, Tencent, Ant Group, and Xiaomi are actively investing and collaborating to build a comprehensive ecosystem for embodied intelligence, while international firms like Tesla and various U.S. investment institutions are focusing on foundational models and humanoid robot prototypes [5]. Technological Evolution - The development of embodied intelligence has progressed through several stages: - The first stage focused on grasp pose detection, which lacked the ability to model task context and action sequences [6]. - The second stage involved behavior cloning, allowing robots to learn from expert demonstrations but revealing weaknesses in generalization and performance in multi-target scenarios [6]. - The third stage introduced Diffusion Policy methods, enhancing stability and generalization by modeling action trajectories [7]. - The fourth stage, starting in 2025, explores the integration of VLA models with reinforcement learning and tactile sensing, aiming to overcome current limitations and improve robots' planning and decision-making capabilities [8]. Product and Market Development - The evolution of embodied intelligence technologies has led to the emergence of various products, including humanoid robots, robotic arms, and quadrupedal robots, serving industries such as manufacturing, home services, dining, and healthcare [9]. - The demand for engineering and system capabilities is increasing as the industry shifts from research to deployment, necessitating higher engineering skills for effective implementation [12].
浙大具身智能VLN+VLA统一框架:ODYSSEY
具身智能之心· 2025-08-25 00:04
Core Insights - The article presents the ODYSSEY framework, which integrates hierarchical task planning with terrain-adaptive full-body control, successfully achieving transfer from simulation to reality and demonstrating strong generalization capabilities in diverse environments and long-term tasks [4][38]. Group 1: Research Background - The framework addresses the limitations of existing research in mobile manipulation, particularly in dynamic and unstructured environments, by proposing a unified mobile operation framework for quadruped robots to execute long-term tasks [5]. - A hierarchical visual-language planner is introduced, capable of decomposing long-term instructions based on self-centered perception into executable actions, bridging the gap between self-centered perception and language-based tasks [4][5]. Group 2: Methodology - The framework includes a full-body control strategy defined as a single network that maps comprehensive observation vectors to target actions, incorporating various sensory inputs [9]. - A two-stage training method is employed: the first stage focuses on training movement under static loads, while the second stage controls all joints and expands the reward function to include end-effector tracking [11]. Group 3: Performance Evaluation - The framework was evaluated through a series of long-term mobile operation tasks, covering diverse indoor and outdoor scenarios, with a total of 246 indoor and 58 outdoor variations [18][20]. - Experimental results indicate that the method achieved significant overall improvements across all datasets, demonstrating superior fine manipulation capabilities compared to the baseline model PerAct, especially in unseen data scenarios [17][29]. Group 4: Real-World Application - The ODYSSEY framework was tested in real-world tasks, such as "navigate to grasp" and "grasp and place," using various objects, showcasing its potential for long-term mobile exploration and operation tasks [36][37]. - Despite achieving over 40% overall success rates in all tasks, challenges remain in robust perception and high-precision control for seamless real-world deployment [37][38].
具身智能之心人形机器人交流群成立啦~
具身智能之心· 2025-08-24 13:22
Group 1 - The article introduces a new community for humanoid robot enthusiasts, focusing on areas such as humanoid control, VLA models, data collection, and hardware [1] - The community aims to connect professionals and students working in related fields to foster collaboration and knowledge sharing [1] Group 2 - Interested individuals are encouraged to add a designated assistant on WeChat with specific instructions for joining the group [2] - The requirement for a nickname and specific keywords for group entry emphasizes the community's organized approach to membership [2]
就在明天!英伟达具身机器人“新大脑”即将揭晓
具身智能之心· 2025-08-24 12:36
Core Viewpoint - Nvidia is positioning itself at the forefront of the emerging Physical AI market, which is expected to unlock trillion-dollar opportunities in the robotics industry, as highlighted by recent developments and announcements from the company [6][7]. Group 1: Nvidia's Announcements and Developments - Nvidia's CEO Jensen Huang teased a significant event scheduled for August 25, 2025, hinting at a new development in robotics [2]. - The company recently released a video previewing a new physical AI application and a robot vision reasoning model called Cosmos Reason, which allows robots to reason and act in the real world [4][6]. - An example provided by Nvidia shows a robotic arm successfully inferring the next action in a scenario involving a toaster and bread, demonstrating the practical application of their reasoning model [5]. Group 2: The Future of Physical AI - Huang has stated that the next wave of AI will be Physical AI, which involves using motion skills to understand and interact with the real world [6]. - Physical AI is encapsulated in autonomous machines like robots and self-driving cars, enabling them to perceive, understand, and execute complex tasks in real-world environments [6]. Group 3: Market Potential and Industry Trends - At the 2025 World Robot Conference, Nvidia's VP Rev Lebaredian mentioned that Physical AI could drive a trillion-dollar market, with significant advancements occurring across various sectors [7]. - Major companies, both domestically and internationally, such as Huawei, ByteDance, BYD, Xiaomi, and Tesla, are intensifying their focus on embodied intelligence, indicating a robust growth trajectory for the robotics industry [7]. - The emergence of companies like DeepSeek is fostering the development of general-purpose robotic models, leading to a competitive landscape in the humanoid robot sector, which is expected to see commercial viability soon [7].
具身真实场景的机器人数据集汇总
具身智能之心· 2025-08-22 16:03
Core Insights - The article focuses on the development and sharing of datasets related to embodied intelligence and robotics, highlighting various projects and their significance in advancing robotic manipulation and learning capabilities [3][4][5][10]. Group 1: Datasets and Projects - BRMData is introduced as a dataset aimed at empowering embodied manipulation for household tasks, with a project link provided for further exploration [4]. - AgiBot World Colosseo is presented as a large-scale manipulation platform designed for scalable and intelligent embodied systems, with a project link included [4]. - RoboMIND serves as a benchmark for multi-embodiment intelligence normative data for robot manipulation, with a project link for access [4]. - OpenX-Embodiment focuses on robotic learning datasets and RT-X models, with a project link available [4]. - DROID is highlighted as a large-scale in-the-wild robot manipulation dataset, with a project link provided [5]. - RH20T is described as a comprehensive robotic dataset for learning diverse skills in one-shot, with a project link included [5]. - BridgeDataV2 is mentioned as a dataset for robot learning at scale, with a project link for further details [5]. - RT-2 emphasizes vision-language foundation models as effective robot imitators, with a project link available [5]. - RT-1 is introduced as a robotics transformer for real-world control at scale, with a project link provided [6]. - Bridge Data aims to boost the generalization of robotic skills with cross-domain datasets, with a project link included [7]. - BC-Z focuses on zero-shot task generalization with robotic imitation learning, with a project link available [7]. Group 2: Community and Collaboration - The article promotes the "Embodied Intelligence Heart" knowledge community as the first developer community in China focused on embodied intelligence, emphasizing its role as a professional exchange platform [10]. - The community covers various topics including datasets, simulation platforms, large models, reinforcement learning, and robotic manipulation, with a summary of over 30 learning paths and 60 datasets available [10]. - The community encourages collaboration among nearly 200 companies and institutions, fostering academic and industrial exchanges [10][13].
ICCV 2025 | 打造通用工具智能体的基石:北大提出ToolVQA数据集
具身智能之心· 2025-08-22 16:03
Core Viewpoint - The article introduces ToolVQA, a large-scale multimodal dataset designed to enhance the tool usage capabilities of foundational models in multi-step reasoning visual question answering (VQA) tasks, addressing the significant performance gap in real-world applications [2][3][7]. Summary by Sections Dataset Overview - ToolVQA contains 23,655 samples, featuring real image scenes and implicit multi-step reasoning tasks, closely aligned with actual user interaction needs [3][22]. - The dataset includes 10 types of multimodal tools and 7 task domains, with an average of 2.78 reasoning steps per sample [3][22]. Data Generation Process - The dataset is generated using a novel data construction process called ToolEngine, which employs depth-first search (DFS) and dynamic context example matching to simulate human-like tool usage reasoning chains [3][15][18]. - ToolEngine allows for fully automated generation of high-quality VQA instances from a single image input, significantly reducing data costs and enabling scalability [15][18]. Key Features of ToolVQA - The dataset features complex visual scenes with real-world context and challenging queries requiring implicit multi-step reasoning [13][15]. - Each question necessitates the model to autonomously plan the order of tool calls through multiple interactions, rather than being explicitly prompted [15][20]. - ToolVQA encompasses a rich variety of tools, supporting tasks from text extraction to image understanding and numerical calculations [15][22]. Model Performance - Fine-tuning on ToolVQA significantly enhances model performance, with the 7B model outperforming the closed-source GPT-3.5-turbo on multiple evaluation metrics [3][24]. - The fine-tuned model also demonstrates strong generalization capabilities on out-of-distribution datasets, surpassing GPT-3.5-turbo in various benchmarks [24][25]. Error Analysis - Despite improvements, the analysis of 100 failure cases reveals key bottlenecks in parameter prediction and answer integration, indicating that early errors can lead to cumulative failures in multi-step reasoning tasks [27][28]. - The findings highlight the need for enhanced robustness in models when dealing with dynamic feedback and intermediate information integration [28]. Conclusion - ToolVQA establishes a new benchmark for multi-step tool reasoning tasks, providing a structured framework for training and evaluating models' reasoning and tool understanding capabilities [31].
又帮到了一位同学拿到了VLA算法岗......
具身智能之心· 2025-08-22 16:03
昨天下午有个小朋友,底子还不错,C9即将研三。正在秋招,来找峰哥诉苦,同门找到了VLA算法岗位 (一个特别有钱的具身公司),我想转来不及了......刚开始都是一起做的传统机器人,SLAM相关。后面不 知道他做了什么项目,进度这么快,面试几家都过了。 这两天同门才刚给我推荐你们社区,体系很完整, 就怕有点晚了。 8月份,陆续有同学找到峰哥,不是拿到口头offer,就是想转具身担心来不及。虽然秋招将近, 但还是那 句话,"什么时候都不算太晚。" 尽快把完整的具身路线补齐才是重中之重,特别是数采和算法、仿真等。 如果你没有较强独立学习和搜索问题的能力,可以来我们的具身社区,也是目前国内最大最全的具身学习 平台【具身智能之心】知识星球。 "具身智能之心知识星球"目前集视频 + 图文 + 学习路线 + 问答 + 求职交流为一体,是一个综合类的具身社 区,近2000人了。我们期望未来2年内做到近万人的规模。给大家打造一个交流+技术分享的聚集地,是许 多初学者和进阶的同学经常逛的地方。 社区内部还经常为大家解答各类实用问题:如何使用设备?如何有效采集数据?如何部署VA、VLA模型 等。是采集背景太复杂还是数据比较dirt ...
小模型也能超越GPT-4o!邱锡鹏团队WAP框架打造「世界感知」智能体
具身智能之心· 2025-08-22 00:04
Core Insights - The article discusses the potential of large-scale vision-language models (LVLM) in embodied planning tasks, highlighting the challenges they face in unfamiliar environments and complex multi-step goals [2][6] - A new framework called World-aware Planning (WAP) is introduced, which enhances LVLMs by integrating four cognitive abilities: visual appearance modeling, spatial reasoning, functional abstraction, and syntax grounding [2][6] - The enhanced model, Qwen2.5-VL, achieved a 60.7% absolute task success rate improvement in the EB-ALFRED benchmark, particularly excelling in common-sense reasoning (+60.0%) and long-term planning (+70.0%) [2][6] Summary by Sections Introduction - The article emphasizes the breakthroughs in multimodal models but notes the significant challenges they still face in embodied planning tasks [6] Framework Innovation - The WAP framework is presented as a novel approach that integrates four key cognitive abilities to improve the understanding of the physical world by AI [7] Performance Metrics - The open-source model Qwen2.5-VL significantly outperformed proprietary systems like GPT-4o and Claude-3.5-Sonnet, showcasing a substantial leap in performance [2][6][7] Future Implications - The advancements in embodied planning through the WAP framework open new possibilities for AI applications in real-world scenarios [6][7]