时空推理捷径
Search documents
景不动人动,MLLM如何面对「移步换景」的真实世界?OST-Bench揭示多模态大模型在线时空理解短板
3 6 Ke· 2025-10-14 08:54
Core Insights - The introduction of OST-Bench presents a new challenge for multimodal large language models (MLLMs) by focusing on dynamic online scene understanding, contrasting with traditional offline benchmarks [1][3][12] - OST-Bench emphasizes the necessity for models to perform real-time perception, memory maintenance, and spatiotemporal reasoning based on continuous local observations [3][4][12] Benchmark Characteristics - OST-Bench is designed to reflect real-world challenges more accurately than previous benchmarks, featuring two main characteristics: online settings requiring real-time processing and cross-temporal understanding that integrates current and historical information [3][4][12] - The benchmark categorizes dynamic scene understanding into three information types: agent spatial state, visible information, and agent-object spatial relationships, leading to the creation of 15 sub-tasks [7][12] Experimental Results - The performance of various models on OST-Bench reveals significant gaps between current MLLMs and human-level performance, particularly in complex spatiotemporal reasoning tasks [12][21] - Models like Claude-3.5-Sonnet and GPT-4.1 show varying degrees of success across different tasks, with human-level performance significantly higher than that of the models [9][10][12] Model Limitations - Current MLLMs exhibit a tendency to take shortcuts in reasoning, often relying on limited information rather than comprehensive spatiotemporal integration, which is termed "spatio-temporal reasoning shortcut" [15][18] - The study identifies that the models struggle with long-sequence online settings, indicating a need for improved mechanisms for complex spatial reasoning and long-term memory retrieval [12][21] Future Directions - The findings from OST-Bench suggest that enhancing complex spatial reasoning capabilities and long-term memory mechanisms will be crucial for the next generation of multimodal models to achieve real-world intelligence [22]