模态失衡 - filings, earnings calls, financial reports, news

模态失衡

Search documents

机器之心· 2025-11-01 02:30

Core Viewpoint - The article discusses the limitations of Multimodal Large Language Models (MLLMs) in effectively integrating visual information, highlighting a systemic bias towards text and the diminishing attention to visual tokens during extended reasoning chains [1]. Group 1: Visual Information Neglect in MLLMs - MLLMs, based on Transformer architecture, have made progress in tasks like visual question answering and image description by combining language model reasoning with visual encoding capabilities [5]. - There is a systemic bias in MLLMs' attention distribution, leading to an over-reliance on language and a neglect of visual information, especially in complex reasoning scenarios [5][6]. - As reasoning chains lengthen, the model's focus on image content significantly decreases, while attention to language tokens increases, resulting in a reliance on language cues over visual content [5][6]. Group 2: Amplification of Visual Errors in Deep Reasoning - The imbalance in modalities within MLLMs stems from the disproportionate focus on text data during training, which is often in the trillions, giving LLMs strong language priors [8]. - Visual features, despite being represented in high dimensions, are often overshadowed by language features, leading to their neglect during the initial fusion process [8][9]. - The training objectives of MLLMs favor language data, which is more abstract and compact, causing the model to adopt shortcut learning strategies that prioritize text over complex visual information [9].