Workflow
大脑模型推理能力
icon
Search documents
新国立×上交发布RoboCerebra:长时序机器人操作推理的全新评测基准
自动驾驶之心· 2025-06-29 11:33
Core Insights - The article discusses the development of RoboCerebra, a new benchmark designed to evaluate long-horizon robotic manipulation tasks, emphasizing the need for collaboration between high-level planning (VLM) and low-level control (VLA) models [6][8][10]. Group 1: Background and Motivation - Recent advancements in visual-language models (VLM) have enabled robots to execute commands based on visual inputs, but challenges arise when tasks become more complex, requiring long-term planning and memory management [6][7]. - Existing benchmarks often fail to assess the collaborative capabilities of VLM and VLA, leading to performance issues in dynamic environments [8]. Group 2: RoboCerebra Contributions - RoboCerebra includes a large-scale dataset and a systematic benchmark for evaluating cognitive challenges related to planning, memory, and reflection in robotic tasks [10]. - The dataset construction process integrates automated generation and manual annotation to ensure high quality and scalability [10]. Group 3: Task Setting - The benchmark features long task sequences averaging 2,972 steps, with dynamic disturbances introduced to challenge the models' planning and recovery abilities [11]. - A top-down data generation pipeline utilizes GPT to create high-level tasks, which are then broken down into sub-goals and validated for logical consistency and physical feasibility [11][13]. Group 4: Evaluation Protocol and Metrics - RoboCerebra employs a four-dimensional evaluation framework assessing success rate, plan match accuracy, plan efficiency, and action completion accuracy to measure the collaboration between VLM and VLA [15][21]. - The framework includes anchor points to synchronize evaluations across different models, ensuring consistency in task execution [21]. Group 5: Experimental Results - The hierarchical planning and execution framework significantly improves task success rates, particularly in memory execution scenarios, demonstrating the necessity of collaboration between VLM and VLA [27]. - The results indicate that using either the VLA or VLM alone is insufficient for stable performance in complex tasks, highlighting the importance of their integration [27][28]. Group 6: Memory Task Evaluation - The evaluation of memory tasks shows that the VLM's reasoning capabilities are crucial for both memory exploration and execution, with GPT-4o outperforming other models in exploration success rates and decision accuracy [31][32].