别被室内基准高分骗了：大模型是在推理空间，还是在「背答案」？

Core Insights - The article highlights the emergence of "Spatial Intelligence" as a new frontier in AI, particularly in large models, driven by advancements from scholars like Fei-Fei Li [2] - It raises concerns about the validity of recent performance improvements in models, questioning whether they genuinely understand spatial reasoning or are merely overfitting to similar indoor data distributions [2][16] Group 1: Limitations of Indoor Scene Data - Research in spatial intelligence has predominantly focused on indoor scenes due to a lack of diverse outdoor datasets, which are often based on autonomous driving perspectives, differing fundamentally from first-person pedestrian views [5] - The over-reliance on indoor data leads to high homogeneity between training and testing datasets, making it difficult to fairly assess models' spatial perception and reasoning capabilities [6] Group 2: OSI-Bench Introduction - The OSI-Bench, developed by the University of Chinese Academy of Sciences in collaboration with Microsoft Research Asia and ETH Zurich, aims to provide a more accurate assessment of spatial intelligence by utilizing original video data with precise 3D annotations from open-world environments [2][11] - This benchmark allows for the evaluation of models' true spatial capabilities by decoupling semantic priors from visual spatial intelligence, particularly in complex outdoor settings [9] Group 3: Evaluation Results - Evaluation results from OSI-Bench indicate that current state-of-the-art (SOTA) multimodal large language models generally fail to perform well on spatial reasoning tasks [13] - Despite some models showing significant improvements in indoor benchmarks, such as VSI-Bench, they consistently underperform in OSI-Bench, suggesting overfitting to specific scene distributions rather than genuine spatial intelligence acquisition [16] Group 4: Language Priors and Model Performance - When faced with spatial tasks, models tend to rely on language priors rather than engaging in visual geometric reasoning, leading to minimal performance differences with or without visual input [19][22] - Experiments reveal that models struggle significantly in atypical scenarios where language priors fail, indicating a lack of robust spatial reasoning capabilities [23] Group 5: Future Directions - The article calls for a new paradigm in spatial intelligence that empowers models to perceive and think in spatial contexts, moving beyond mere data-driven distribution fitting [27] - OSI-Bench's benchmark and evaluation code are open-sourced, with plans to continue releasing high-precision 3D information datasets to advance spatial intelligence from indoor to complex open-world scenarios [28]