空间智能终极挑战MMSI-Video-Bench来了

Core Insights - The article discusses the launch of the MMSI-Video-Bench, a comprehensive benchmark for evaluating spatial intelligence in multimodal large language models (MLLMs), emphasizing the need for models to understand and interact with complex real-world environments [1][5][25]. Group 1: Benchmark Features - MMSI-Video-Bench is designed with a systematic approach to assess models' spatial perception capabilities, focusing on spatial construction and motion understanding [5][6]. - The benchmark evaluates high-level decision-making abilities based on spatiotemporal information, including memory update and multi-view integration [6][7]. - It consists of five main task types and 13 subcategories, covering planning and prediction capabilities [9]. Group 2: Model Performance - The benchmark revealed that even the best-performing model, Gemini 3 Pro, achieved only 38% accuracy, indicating a significant performance gap of nearly 60% compared to human levels [10][14]. - The evaluation highlighted deficiencies in models' spatial construction, motion understanding, planning, and prediction capabilities [14][16]. - Detailed error analysis identified five main types of errors affecting model performance, including detailed grounding errors and geometric reasoning errors [16][20]. Group 3: Data Sources and Evaluation - The video data for MMSI-Video-Bench is sourced from 25 public datasets and one self-built dataset, encompassing various real-world scenarios [11]. - The benchmark allows for targeted assessments of specific capabilities in indoor scene perception, robotics, and grounding [11]. Group 4: Future Directions - The article suggests that introducing 3D spatial cues could enhance model understanding and reasoning capabilities [21][26]. - It emphasizes the ongoing challenge of designing models that can effectively utilize spatial cues and highlights that current failures are rooted in fundamental reasoning limitations rather than a lack of explicit reasoning steps [26].