大小脑协同推理

Search documents
新国立×上交发布RoboCerebra:长时序机器人操作推理的全新评测基准
自动驾驶之心· 2025-06-29 11:33
Core Insights - The article discusses the development of RoboCerebra, a new benchmark designed to evaluate long-horizon robotic manipulation tasks, emphasizing the need for collaboration between high-level planning (VLM) and low-level control (VLA) models [6][8][10]. Group 1: Background and Motivation - Recent advancements in visual-language models (VLM) have enabled robots to execute commands based on visual inputs, but challenges arise when tasks become more complex, requiring long-term planning and memory management [6][7]. - Existing benchmarks often fail to assess the collaborative capabilities of VLM and VLA, leading to performance issues in dynamic environments [8]. Group 2: RoboCerebra Contributions - RoboCerebra includes a large-scale dataset and a systematic benchmark for evaluating cognitive challenges related to planning, memory, and reflection in robotic tasks [10]. - The dataset construction process integrates automated generation and manual annotation to ensure high quality and scalability [10]. Group 3: Task Setting - The benchmark features long task sequences averaging 2,972 steps, with dynamic disturbances introduced to challenge the models' planning and recovery abilities [11]. - A top-down data generation pipeline utilizes GPT to create high-level tasks, which are then broken down into sub-goals and validated for logical consistency and physical feasibility [11][13]. Group 4: Evaluation Protocol and Metrics - RoboCerebra employs a four-dimensional evaluation framework assessing success rate, plan match accuracy, plan efficiency, and action completion accuracy to measure the collaboration between VLM and VLA [15][21]. - The framework includes anchor points to synchronize evaluations across different models, ensuring consistency in task execution [21]. Group 5: Experimental Results - The hierarchical planning and execution framework significantly improves task success rates, particularly in memory execution scenarios, demonstrating the necessity of collaboration between VLM and VLA [27]. - The results indicate that using either the VLA or VLM alone is insufficient for stable performance in complex tasks, highlighting the importance of their integration [27][28]. Group 6: Memory Task Evaluation - The evaluation of memory tasks shows that the VLM's reasoning capabilities are crucial for both memory exploration and execution, with GPT-4o outperforming other models in exploration success rates and decision accuracy [31][32].
北航×新国立×上交发布RoboCerebra:长时序机器人操作推理的全新评测基准
具身智能之心· 2025-06-28 07:48
Core Insights - The article discusses the development of RoboCerebra, a new benchmark designed to evaluate long-horizon robotic manipulation tasks, emphasizing the need for collaboration between high-level planning (VLM) and low-level control (VLA) models [6][8][10]. Group 1: Background and Motivation - Recent advancements in visual-language models (VLM) have enabled robots to execute commands based on natural language, but as tasks become more complex, a dual system involving both a "brain" (VLM) for planning and a "controller" (VLA) for execution is necessary [6][7]. - Existing benchmarks often fail to assess the collaborative capabilities of these systems, leading to the creation of RoboCerebra to evaluate long-term planning and memory management [8]. Group 2: RoboCerebra Contributions - RoboCerebra includes a large-scale dataset and a systematic benchmark for assessing cognitive challenges related to planning, memory, and reflection in robotic tasks [10]. - The dataset construction process integrates automated generation and manual annotation to ensure high quality and scalability [10]. Group 3: Task Setting - The benchmark features long task sequences averaging 2,972 steps, with dynamic disturbances introduced to challenge the models' planning and recovery abilities [13]. - A top-down data generation pipeline utilizes GPT to create high-level tasks, which are then broken down into sub-goals and verified for feasibility [13][14]. Group 4: Evaluation Protocol and Metrics - RoboCerebra employs a four-dimensional evaluation system that includes success rate, plan match accuracy, plan efficiency, and action completion accuracy to assess the collaboration between VLM and VLA [15][21]. - The framework introduces anchor points to synchronize evaluation across different models, ensuring consistency in task execution [21]. Group 5: Experimental Results - The hierarchical framework demonstrates that the collaboration between VLM and VLA significantly improves task success rates, particularly in memory execution scenarios, with improvements exceeding 70% [27]. - The results indicate that neither the VLA nor the VLM alone can effectively handle long-horizon tasks, highlighting the necessity of their integration [27][28]. Group 6: Model Evaluation - GPT-4o outperforms other models in planning accuracy, task success rate, and plan efficiency, underscoring the importance of strong language reasoning capabilities in executing long-term tasks [30]. - In memory-related tasks, GPT-4o shows superior exploration and execution decision-making abilities compared to other models, indicating its robustness in understanding scenes and recalling memories [31].