o3绞尽脑汁仅答对40%的题目，开源模型基本乱猜？MMSI-Bench：多图空间智能试金石

Core Insights - The article discusses the limitations of current multi-image spatial reasoning capabilities in large multimodal language models (MLLMs), highlighting the need for a dedicated benchmark, MMSI-Bench, to evaluate and improve these models' spatial intelligence [1][2][4]. Group 1: Importance of Spatial Intelligence - Spatial intelligence, which includes understanding object positions and movements, is crucial for applications like autonomous driving and robotic navigation [2]. - Current assessments of MLLM spatial intelligence often focus on single images, failing to capture the complexity of real-world scenarios [3][5]. Group 2: MMSI-Bench Overview - MMSI-Bench is designed to evaluate MLLM's multi-image spatial reasoning abilities, emphasizing the quality of data and the importance of human-centered sample construction [7][8]. - The benchmark includes 1,000 high-quality question-answer pairs derived from over 120,000 images, ensuring that questions are challenging and require integration of multiple images [8][12]. Group 3: Evaluation Findings - A comprehensive evaluation of 34 widely used MLLMs revealed that even the best-performing models, such as OpenAI's o3, achieved only 41% accuracy, significantly lower than the human benchmark of 97.2% [15][16]. - The analysis identified that most models struggle with multi-step reasoning and understanding camera motion, indicating a significant gap in their spatial reasoning capabilities [18][19]. Group 4: Error Analysis - An automated error analysis process was developed to diagnose the failures of MLLMs, categorizing errors into four main types: grounding errors, overlap-matching errors, situation-transformation reasoning errors, and spatial-logic errors [20][21]. - The combination of human insights and automated tools in MMSI-Bench allows for a deeper understanding of model failures, which can guide future improvements in spatial intelligence [22]. Group 5: Future Directions - MMSI-Bench aims to serve as a valuable resource for the community, promoting the development of more robust multimodal AI systems that can better understand and interact with the physical world [23]. - The benchmark's focus on real-world scenarios and high-quality human annotations is expected to enhance the reliability of automated error analysis and model evaluation [24].