Workflow
Attention去偏修正
icon
Search documents
多模态大模型中Attention机制暗藏「骗局」,需用一个公式修正
3 6 Ke· 2026-01-27 08:15
Core Insights - Vision-Language Models (VLMs) have made significant progress in multimodal understanding tasks, particularly in visual question answering, image understanding, and video understanding, by utilizing language-to-vision attention to assess the relevance between visual tokens and text [1] - A critical issue identified is that attention may not reliably indicate "semantic importance" due to structural biases affecting its behavior, which can mislead visual token pruning strategies [1][12] Attention Bias Sources - **Position Bias (Recency Bias)**: The study found that attention tends to favor "later tokens" in a sequence, leading to a systematic preference for visual tokens located towards the bottom of images, which does not correlate with their semantic relevance [2] - **Padding Attention Sink**: The research also revealed that padding areas in images, which do not contain useful information, often receive disproportionately high attention due to extreme activation values in hidden states, misleading pruning strategies [4] Debiasing Attention - The research team proposed a method to correct the biases in attention rather than introducing new pruning methods or additional training processes. They modeled the stable trends of biases in attention and applied debiasing techniques to enhance the semantic relevance of attention [5] - During the pruning phase, explicit suppression of padding area contributions was implemented to mitigate the effects of attention sink on token ranking [5] Experimental Results - The debiasing strategy was integrated as a plug-and-play module into various mainstream attention-based visual token pruning methods, tested across multiple VLMs (7B/13B) and evaluated on 10 image understanding tasks and 3 video understanding tasks [8] - Results indicated that the pruning models with attention debiasing consistently achieved performance improvements, particularly under more aggressive token compression conditions [8] Conclusion - The findings suggest that attention is not inherently equivalent to semantic importance in VLMs. Ignoring inherent structural biases in attention can mislead pruning strategies, affecting overall model performance. The proposed debiasing method enhances the reliability and generalization of visual token pruning without incurring additional training costs, providing a new perspective for efficient deployment of multimodal models [12]