大模型在具身推理上「翻车」了？4496 道题全面揭示短板

Core Insights - The article focuses on the evaluation of multimodal large language models (MLLMs) in embodied intelligence tasks, providing detailed failure analysis and proposing an agent algorithm for improvement [25]. Group 1: Embodied Intelligence and MLLMs - Embodied intelligence is a concept where an agent can complete a closed-loop of perception, understanding, and decision-making in an environment, relying on various skills [2]. - Many excellent works have deployed MLLMs in different applications of embodied intelligence, but evaluations have mainly focused on subfields like pointing and spatial reasoning [2][4]. Group 2: BEAR Benchmark - The BEAR benchmark was proposed by Northeastern University in collaboration with other institutions to systematically evaluate MLLMs across various sub-capabilities, providing detailed error analysis and algorithm enhancements [4]. - BEAR includes 4,469 image-video-text VQA tasks and covers six major categories, including five foundational categories and a sixth long-range reasoning category, breaking down tasks into 14 different skills [8][9]. Group 3: Evaluation Results - The evaluation measured 20 different MLLMs, revealing that the best-performing model, GPT-5, only achieved a 52% success rate on the BEAR benchmark [11]. - Closed-source models generally performed better than open-source models, although some open-source models like the InternVL series showed strong potential, outperforming models like GPT-4o and Claude [11]. Group 4: Error Analysis - A fine-grained error analysis of GPT-4o revealed interesting findings, indicating that the model's visual capabilities are a major bottleneck across multiple categories, particularly in language grounding and trajectory understanding [19]. - The analysis showed that 88% of errors in long-range reasoning were attributed to lower-level perception and spatial reasoning issues [19]. Group 5: BEAR-Agent Development - The authors developed BEAR-Agent, a multimodal agent designed to enhance visual reasoning capabilities by providing tools and drawing auxiliary lines, significantly improving performance on the BEAR benchmark [17]. - The performance of both the best open-source model (InternVL3-14B) and the closed-source model (GPT-5) improved significantly with the integration of BEAR-Agent [17]. Group 6: Simulation Testing - Further experiments in a desktop manipulation environment demonstrated that BEAR-Agent improved the performance of MOKA by 20.17%, indicating its potential for embodied agents [21].