在线时空理解

Search documents
景不动人动,MLLM如何面对「移步换景」的真实世界?OST-Bench揭示多模态大模型在线时空理解短板
机器之心· 2025-10-14 06:33
Core Insights - The article discusses the introduction of OST-Bench, a new benchmark for evaluating multi-modal large language models (MLLMs) in dynamic online environments, emphasizing the challenges of real-world embodied perception and reasoning [2][24]. Group 1: Benchmark Characteristics - OST-Bench reflects the core challenges of embodied perception in real-world settings, contrasting with traditional offline benchmarks that do not account for dynamic scene exploration [2][7]. - The benchmark is designed to assess models' abilities to perform real-time perception, memory maintenance, and spatiotemporal reasoning based on continuous local observations [7][10]. - It includes 15 sub-tasks categorized into judgment, estimation, counting, and temporal localization, with a dataset comprising 10,000 test samples and 50,000 training samples [8][10]. Group 2: Model Performance and Challenges - Current mainstream MLLMs show significant performance gaps compared to human capabilities, particularly in cross-temporal information reasoning [17]. - Models struggle with complex spatiotemporal reasoning tasks, often resorting to "spatio-temporal reasoning shortcuts," leading to superficial answers without adequate reasoning [18][21]. - Fine-tuning experiments indicate that while models can improve their scores by over 10% with additional training data, they still fail to achieve over 50% accuracy in complex reasoning tasks, highlighting the need for better model design and training strategies [23][24].