空间智能终极挑战MMSI-Video-Bench来了,顶级大模型全军覆没
机器之心·2026-01-05 08:54

Core Insights - The article discusses the importance of spatial understanding capabilities in multimodal large language models (MLLMs) for their transition into real-world applications as "general intelligent assistants" [2] - It highlights the limitations of existing spatial intelligence evaluation benchmarks, which either rely heavily on template generation or focus on specific spatial tasks, making it difficult to comprehensively assess models' spatial understanding and reasoning abilities in real-world scenarios [2] Group 1: Introduction of MMSI-Video-Bench - The Shanghai Artificial Intelligence Laboratory's InternRobotics team has launched a comprehensive and rigorous spatial intelligence video benchmark called MMSI-Video-Bench, designed to challenge current mainstream multimodal models [2][6] - The benchmark aims to evaluate models' spatial perception, reasoning, and decision-making capabilities in complex and dynamic real-world environments [2][7] Group 2: Benchmark Characteristics - MMSI-Video-Bench features a systematic design of question types that assess models' basic spatial perception abilities based on spatiotemporal information [6] - It includes high-level decision-making evaluations and extends task categories to cover complex real-world scenarios, testing models' cross-video reasoning capabilities, memory update abilities, and multi-view integration [6][8] - The benchmark consists of five major task types and 13 subcategories, ensuring a comprehensive evaluation of spatial intelligence [10] Group 3: Challenge and Performance - The benchmark's questions are designed to be highly challenging, with all models tested, including the best-performing Gemini 3 Pro, achieving only a 38% accuracy rate, indicating a significant performance gap of approximately 60% compared to human levels [10][14] - The evaluation reveals that models struggle with spatial construction, motion understanding, planning, prediction, and cross-video reasoning, highlighting critical bottlenecks in their capabilities [14][15] Group 4: Error Analysis - The research team identified five main types of errors affecting model performance: detailed grounding errors, ID mapping errors, latent logical inference errors, prompt alignment errors, and geometric reasoning errors [17][21] - Geometric reasoning errors were found to be the most prevalent, significantly impacting performance, particularly in spatial construction tasks [19][21] Group 5: Future Directions - The article suggests that introducing 3D spatial cues could assist models in understanding spatial relationships better, indicating a potential direction for future research [22][24] - It emphasizes the need for effective design of spatial cues that models can truly understand and utilize, as current failures are attributed to underlying reasoning capabilities rather than a lack of explicit reasoning steps [27]