破解多模态大模型“选择困难症”！内部决策机制首次揭秘：在冲突信息间疯狂"振荡"

Core Argument - The article argues that modality following in multi-modal large language models (MLLMs) is a dynamic process influenced by relative reasoning uncertainty and inherent modality preference, rather than a static attribute [1][4][37]. Group 1: Research Contributions - A new toy dataset was constructed to systematically and independently vary the reasoning difficulty of visual and textual inputs, enabling different difficulty combinations for multi-modal inputs [4]. - The study decomposes the explicit behavior of modality following into two core components: case-specific relative reasoning uncertainty and the model's stable inherent modality preference [4][5]. - An empirical finding indicates that the probability of a model following a certain modality decreases monotonically as the relative reasoning uncertainty of that modality increases [5]. Group 2: Framework Design - A controlled dataset was created to validate hypotheses, allowing independent control of visual and textual reasoning complexity [9][10]. - Uncertainty was measured using output entropy, which reflects the model's perceived uncertainty, with lower entropy indicating confident predictions and higher entropy indicating consideration of alternative options [11]. - Relative uncertainty was quantified to measure the confidence gap between text and visual modalities, providing a core metric for subsequent analysis [12]. Group 3: Limitations of Traditional Metrics - Traditional macro metrics like Text Following Rate (TFR) and Visual Following Rate (VFR) were tested on the constructed dataset, revealing confusing patterns that highlight their limitations [14]. - The study identifies a common trend where models perceive text as easier on average, yet exhibit opposite macro preferences, raising questions about the underlying reasons for these discrepancies [15][16]. Group 4: Experimental Paradigm - A new experimental paradigm was designed to decouple model capability from preference, allowing for a clearer understanding of the model's decision-making process [18]. - The researchers grouped data points based on relative uncertainty to create a complete preference curve, reflecting how model preferences change dynamically with relative difficulty [18]. Group 5: Key Experimental Findings - All tested models exhibited a consistent trend where the probability of following text decreases smoothly as text becomes relatively more difficult [19][21]. - The "balance point" was defined as the point where the curve crosses the 50% probability line, serving as a quantifiable measure of inherent modality preference [22]. - The framework successfully explained previous puzzles regarding model behavior by revealing differences in inherent preferences that were not visible in macro metrics [23][24]. Group 6: Internal Mechanisms - The study explored the internal decision-making mechanisms of models, particularly their oscillation behavior when faced with conflicting information near the balance point [29][30]. - The findings indicate that models exhibit higher oscillation counts in ambiguous regions, providing a mechanistic explanation for observed indecision in external behavior [34][36]. Conclusion - The research presents a new framework for understanding modality following in MLLMs, emphasizing the importance of separating model capability from inherent preference, and revealing a robust rule that the likelihood of following a modality decreases with increasing relative uncertainty [37].