语言先验
Search documents
别被室内基准高分骗了:大模型是在推理空间,还是在「背答案」?
机器之心· 2026-01-06 09:38
2025 年,随着李飞飞等学者将 "空间智能"(Spatial Intelligence)推向聚光灯下,这一领域迅速成为了大模型竞逐的新高地。通用大模型和各类专家模型纷纷在诸 多室内空间推理基准上刷新 SOTA,似乎 AI 在训练中已经更好地读懂了三维空间。 然而,这背后存在着隐忧:由于带有准确 3D 标注数据的稀缺,模型训练所用数据(如 ScanNet++、ARKitScenes)往往与测试基准高度同源。这种数据的 "近亲繁 殖" 让我们不得不担忧: 近期模型分数的飙升,究竟是真正习得了空间几何推理能力,还是仅仅因为 "看多了" 类似的室内 数据 分布,从而学会了 "背答案"? 为了回答这个问题, 中国科学院大学机器学习与感知实验室联合微软亚洲研究院以及苏黎世联邦理工大学共同发布了 全新空间智能基准 OSI-Bench ,从数据源头 出发,基于自采开放世界中带有准确 3D 标注的视频数据,提供了对空间智能真正诊断的能力。由此出发,该工作重新审视了当前大模型的空间能力是否得到了发 展。真正的空间智能鸿沟,或许无法在现有数据范式下仅靠简单的微调来填平。 室 内场景的 局限 近年来,空间智能的研究大多聚焦于室内场 ...
语言先验「基础过强」,MLLMs 视觉衰减有何解?
机器之心· 2025-11-01 02:30
Core Viewpoint - The article discusses the limitations of Multimodal Large Language Models (MLLMs) in effectively integrating visual information, highlighting a systemic bias towards text and the diminishing attention to visual tokens during extended reasoning chains [1]. Group 1: Visual Information Neglect in MLLMs - MLLMs, based on Transformer architecture, have made progress in tasks like visual question answering and image description by combining language model reasoning with visual encoding capabilities [5]. - There is a systemic bias in MLLMs' attention distribution, leading to an over-reliance on language and a neglect of visual information, especially in complex reasoning scenarios [5][6]. - As reasoning chains lengthen, the model's focus on image content significantly decreases, while attention to language tokens increases, resulting in a reliance on language cues over visual content [5][6]. Group 2: Amplification of Visual Errors in Deep Reasoning - The imbalance in modalities within MLLMs stems from the disproportionate focus on text data during training, which is often in the trillions, giving LLMs strong language priors [8]. - Visual features, despite being represented in high dimensions, are often overshadowed by language features, leading to their neglect during the initial fusion process [8][9]. - The training objectives of MLLMs favor language data, which is more abstract and compact, causing the model to adopt shortcut learning strategies that prioritize text over complex visual information [9].