AI能看懂图像却算不好距离，上交时间-空间智能基准难倒9大顶尖多模态模型

Core Insights - The article discusses the increasing application of Multi-Modal Large Language Models (MLLM) in embodied intelligence and autonomous driving, questioning their readiness to understand complex physical environments [1][2] - The introduction of the Spatial-Temporal Intelligence Benchmark (STI-Bench) aims to challenge current MLLMs on their precise spatial-temporal understanding capabilities [1][4] Group 1: MLLM Capabilities - MLLMs have shown significant achievements in visual language understanding but need to surpass traditional semantic understanding to possess accurate spatial-temporal intelligence [2] - The core tasks in AI applications, such as autonomous driving and robotic operations, require quantitative spatial-temporal understanding, which is currently a weak point for existing models [3][19] Group 2: STI-Bench Overview - STI-Bench is designed to evaluate models using real-world video inputs, focusing on precise and quantitative spatial-temporal understanding [4] - The benchmark includes over 300 real-world videos covering three typical scenarios: desktop operations (millimeter-level), indoor environments (centimeter-level), and outdoor scenes (decimeter-level) [6] Group 3: Evaluation Metrics - The evaluation consists of eight tasks divided into two dimensions: static spatial understanding (measuring scale, spatial relationships, and 3D video localization) and dynamic temporal understanding (displacement, speed, acceleration, ego orientation, trajectory description, and pose estimation) [6] - The dataset also includes over 2,000 high-quality question-answer pairs, ensuring accuracy and relevance to the corresponding scenes [8] Group 4: Experimental Results - The evaluation of leading MLLMs, including proprietary models like GPT-4o and Gemini-2.5-Pro, revealed overall poor performance, with the best models achieving less than 42% accuracy, only slightly above random guessing [12][20] - Qwen2.5-VL-72B emerged as a standout, outperforming all proprietary models and providing a boost to the open-source community [13] Group 5: Error Analysis - The research identified three core bottlenecks in MLLMs: inaccuracies in estimating quantitative spatial attributes, deficiencies in understanding temporal dynamics, and weak cross-modal integration capabilities [15][16][17] - These issues highlight the significant gaps in MLLMs' abilities to perform precise spatial-temporal understanding, indicating directions for future research [19][20] Group 6: Conclusion - The results from STI-Bench clearly indicate the serious shortcomings of current MLLMs in precise spatial-temporal understanding, which is essential for their application in embodied intelligence and autonomous driving [20][21] - The release of STI-Bench provides a new benchmark for assessing and improving MLLMs' spatial-temporal understanding capabilities, guiding researchers towards potential solutions [21]