Core Insights - The article discusses the reliability of attention mechanisms in Vision-Language Models (VLMs), highlighting that attention may not be a trustworthy indicator of semantic importance due to structural biases [2][12] Group 1: Attention Mechanism Issues - Attention is influenced by structural biases, such as position bias, which favors later tokens in a sequence, leading to potential misinterpretation during visual token pruning [3][5] - The phenomenon of "padding attention sink" is identified, where padding areas receive disproportionately high attention, misleading pruning strategies [5][6] Group 2: Proposed Solutions - The research team from Shanghai University suggests a debiasing approach to correct attention biases without introducing new pruning methods or additional training processes [6][12] - By modeling the overall trends of attention biases, the team effectively reduces irrelevant positional factors, enhancing the semantic relevance of attention [6][12] Group 3: Experimental Results - The debiasing strategy was integrated as a plug-and-play module into various mainstream attention-based visual token pruning methods, showing consistent performance improvements across multiple tasks [7][10] - Experimental results indicate that the pruning models with the debiasing correction achieved stable performance enhancements, particularly under aggressive token compression conditions [10][12] Group 4: Conclusion - The findings emphasize that attention is not inherently equivalent to semantic importance, and ignoring inherent structural biases can mislead pruning strategies, affecting overall model performance [12]
多模态大模型中Attention机制暗藏「骗局」,需用一个公式修正丨上大×南开
量子位·2026-01-27 02:33