破解多模态大模型“选择困难症”！内部决策机制首次揭秘：在冲突信息间疯狂"振荡"

Core Argument - The article argues that modality following in multi-modal large language models (MLLMs) is a dynamic process influenced by relative reasoning uncertainty and inherent modality preference, rather than a static attribute [1][4][37]. Group 1: Contributions and Findings - A new controlled toy dataset was constructed to systematically manipulate the reasoning difficulty of visual and textual inputs [4]. - The study decomposes modality following into two core components: case-specific relative reasoning uncertainty and the model's stable inherent modality preference [4][5]. - A fundamental finding indicates that the probability of a model following a certain modality decreases monotonically as the relative reasoning uncertainty of that modality increases [5]. - The framework provides a more reasonable method for quantifying inherent preference, defining it as the balance point where the model treats both modalities equally [5][22]. - The research explores the internal decision-making mechanisms of models, revealing oscillations in predictions when uncertainty is near the balance point [5][29]. Group 2: Experimental Design - The researchers established a controlled experimental environment using a novel toy dataset that independently controls visual and textual reasoning complexity [9][10]. - A model-centered uncertainty metric, output entropy, was employed to reflect the model's perceived uncertainty [11]. - Relative single-modal uncertainty was introduced to quantify the confidence gap in each conflicting case, serving as a core metric for subsequent analysis [12]. Group 3: Limitations of Traditional Metrics - Traditional macro metrics like Text Following Rate (TFR) and Visual Following Rate (VFR) were tested on the constructed dataset, revealing confusing patterns that highlight their limitations [14]. - The study identifies two puzzles regarding the models' preferences and difficulty perceptions, suggesting that traditional metrics obscure the true motivations behind model decisions [16][23]. Group 4: New Experimental Paradigm - A new experimental paradigm was designed to decouple model capability from preference, allowing for a clearer understanding of the models' decision-making processes [18]. - The researchers grouped data points based on relative uncertainty to create a complete preference curve reflecting how model preferences change with relative difficulty [18]. Group 5: Key Experimental Discoveries - All tested models exhibited a consistent trend: as text becomes relatively more difficult, the probability of following text decreases smoothly [19][21]. - The balance point quantifies inherent preference, indicating whether a model has a visual or textual bias based on its position on the relative uncertainty axis [22]. - The framework successfully explains the previously mentioned puzzles by revealing differences in inherent preferences among models [23][24]. Group 6: Internal Mechanisms - The study investigates why models exhibit oscillations in decision-making when approaching their balance point, providing a mechanism for observed indecision [29][33]. - The distinction between clear and ambiguous regions in input uncertainty is made, with oscillation frequency being significantly higher in ambiguous regions [30][34].