北航×新国立×上交发布RoboCerebra：长时序机器人操作推理的全新评测基准

Core Insights - The article discusses the development of RoboCerebra, a new benchmark designed to evaluate long-horizon robotic manipulation tasks, emphasizing the need for collaboration between high-level planning (VLM) and low-level control (VLA) models [6][8][10]. Group 1: Background and Motivation - Recent advancements in visual-language models (VLM) have enabled robots to execute commands based on natural language, but as tasks become more complex, a dual system involving both a "brain" (VLM) for planning and a "controller" (VLA) for execution is necessary [6][7]. - Existing benchmarks often fail to assess the collaborative capabilities of these systems, leading to the creation of RoboCerebra to evaluate long-term planning and memory management [8]. Group 2: RoboCerebra Contributions - RoboCerebra includes a large-scale dataset and a systematic benchmark for assessing cognitive challenges related to planning, memory, and reflection in robotic tasks [10]. - The dataset construction process integrates automated generation and manual annotation to ensure high quality and scalability [10]. Group 3: Task Setting - The benchmark features long task sequences averaging 2,972 steps, with dynamic disturbances introduced to challenge the models' planning and recovery abilities [13]. - A top-down data generation pipeline utilizes GPT to create high-level tasks, which are then broken down into sub-goals and verified for feasibility [13][14]. Group 4: Evaluation Protocol and Metrics - RoboCerebra employs a four-dimensional evaluation system that includes success rate, plan match accuracy, plan efficiency, and action completion accuracy to assess the collaboration between VLM and VLA [15][21]. - The framework introduces anchor points to synchronize evaluation across different models, ensuring consistency in task execution [21]. Group 5: Experimental Results - The hierarchical framework demonstrates that the collaboration between VLM and VLA significantly improves task success rates, particularly in memory execution scenarios, with improvements exceeding 70% [27]. - The results indicate that neither the VLA nor the VLM alone can effectively handle long-horizon tasks, highlighting the necessity of their integration [27][28]. Group 6: Model Evaluation - GPT-4o outperforms other models in planning accuracy, task success rate, and plan efficiency, underscoring the importance of strong language reasoning capabilities in executing long-term tasks [30]. - In memory-related tasks, GPT-4o shows superior exploration and execution decision-making abilities compared to other models, indicating its robustness in understanding scenes and recalling memories [31].