Workflow
语言先验
icon
Search documents
别被室内基准高分骗了:大模型是在推理空间,还是在「背答案」?
机器之心· 2026-01-06 09:38
Core Insights - The article highlights the emergence of "Spatial Intelligence" as a new frontier in AI, particularly in large models, driven by advancements from scholars like Fei-Fei Li [2] - It raises concerns about the validity of recent performance improvements in models, questioning whether they genuinely understand spatial reasoning or are merely overfitting to similar indoor data distributions [2][16] Group 1: Limitations of Indoor Scene Data - Research in spatial intelligence has predominantly focused on indoor scenes due to a lack of diverse outdoor datasets, which are often based on autonomous driving perspectives, differing fundamentally from first-person pedestrian views [5] - The over-reliance on indoor data leads to high homogeneity between training and testing datasets, making it difficult to fairly assess models' spatial perception and reasoning capabilities [6] Group 2: OSI-Bench Introduction - The OSI-Bench, developed by the University of Chinese Academy of Sciences in collaboration with Microsoft Research Asia and ETH Zurich, aims to provide a more accurate assessment of spatial intelligence by utilizing original video data with precise 3D annotations from open-world environments [2][11] - This benchmark allows for the evaluation of models' true spatial capabilities by decoupling semantic priors from visual spatial intelligence, particularly in complex outdoor settings [9] Group 3: Evaluation Results - Evaluation results from OSI-Bench indicate that current state-of-the-art (SOTA) multimodal large language models generally fail to perform well on spatial reasoning tasks [13] - Despite some models showing significant improvements in indoor benchmarks, such as VSI-Bench, they consistently underperform in OSI-Bench, suggesting overfitting to specific scene distributions rather than genuine spatial intelligence acquisition [16] Group 4: Language Priors and Model Performance - When faced with spatial tasks, models tend to rely on language priors rather than engaging in visual geometric reasoning, leading to minimal performance differences with or without visual input [19][22] - Experiments reveal that models struggle significantly in atypical scenarios where language priors fail, indicating a lack of robust spatial reasoning capabilities [23] Group 5: Future Directions - The article calls for a new paradigm in spatial intelligence that empowers models to perceive and think in spatial contexts, moving beyond mere data-driven distribution fitting [27] - OSI-Bench's benchmark and evaluation code are open-sourced, with plans to continue releasing high-precision 3D information datasets to advance spatial intelligence from indoor to complex open-world scenarios [28]
语言先验「基础过强」,MLLMs 视觉衰减有何解?
机器之心· 2025-11-01 02:30
Core Viewpoint - The article discusses the limitations of Multimodal Large Language Models (MLLMs) in effectively integrating visual information, highlighting a systemic bias towards text and the diminishing attention to visual tokens during extended reasoning chains [1]. Group 1: Visual Information Neglect in MLLMs - MLLMs, based on Transformer architecture, have made progress in tasks like visual question answering and image description by combining language model reasoning with visual encoding capabilities [5]. - There is a systemic bias in MLLMs' attention distribution, leading to an over-reliance on language and a neglect of visual information, especially in complex reasoning scenarios [5][6]. - As reasoning chains lengthen, the model's focus on image content significantly decreases, while attention to language tokens increases, resulting in a reliance on language cues over visual content [5][6]. Group 2: Amplification of Visual Errors in Deep Reasoning - The imbalance in modalities within MLLMs stems from the disproportionate focus on text data during training, which is often in the trillions, giving LLMs strong language priors [8]. - Visual features, despite being represented in high dimensions, are often overshadowed by language features, leading to their neglect during the initial fusion process [8][9]. - The training objectives of MLLMs favor language data, which is more abstract and compact, causing the model to adopt shortcut learning strategies that prioritize text over complex visual information [9].