Attention机制
Search documents
多模态大模型中Attention机制暗藏「骗局」,需用一个公式修正
3 6 Ke· 2026-01-27 08:15
Core Insights - Vision-Language Models (VLMs) have made significant progress in multimodal understanding tasks, particularly in visual question answering, image understanding, and video understanding, by utilizing language-to-vision attention to assess the relevance between visual tokens and text [1] - A critical issue identified is that attention may not reliably indicate "semantic importance" due to structural biases affecting its behavior, which can mislead visual token pruning strategies [1][12] Attention Bias Sources - **Position Bias (Recency Bias)**: The study found that attention tends to favor "later tokens" in a sequence, leading to a systematic preference for visual tokens located towards the bottom of images, which does not correlate with their semantic relevance [2] - **Padding Attention Sink**: The research also revealed that padding areas in images, which do not contain useful information, often receive disproportionately high attention due to extreme activation values in hidden states, misleading pruning strategies [4] Debiasing Attention - The research team proposed a method to correct the biases in attention rather than introducing new pruning methods or additional training processes. They modeled the stable trends of biases in attention and applied debiasing techniques to enhance the semantic relevance of attention [5] - During the pruning phase, explicit suppression of padding area contributions was implemented to mitigate the effects of attention sink on token ranking [5] Experimental Results - The debiasing strategy was integrated as a plug-and-play module into various mainstream attention-based visual token pruning methods, tested across multiple VLMs (7B/13B) and evaluated on 10 image understanding tasks and 3 video understanding tasks [8] - Results indicated that the pruning models with attention debiasing consistently achieved performance improvements, particularly under more aggressive token compression conditions [8] Conclusion - The findings suggest that attention is not inherently equivalent to semantic importance in VLMs. Ignoring inherent structural biases in attention can mislead pruning strategies, affecting overall model performance. The proposed debiasing method enhances the reliability and generalization of visual token pruning without incurring additional training costs, providing a new perspective for efficient deployment of multimodal models [12]
多模态大模型中Attention机制暗藏「骗局」,需用一个公式修正丨上大×南开
量子位· 2026-01-27 02:33
Core Insights - The article discusses the reliability of attention mechanisms in Vision-Language Models (VLMs), highlighting that attention may not be a trustworthy indicator of semantic importance due to structural biases [2][12] Group 1: Attention Mechanism Issues - Attention is influenced by structural biases, such as position bias, which favors later tokens in a sequence, leading to potential misinterpretation during visual token pruning [3][5] - The phenomenon of "padding attention sink" is identified, where padding areas receive disproportionately high attention, misleading pruning strategies [5][6] Group 2: Proposed Solutions - The research team from Shanghai University suggests a debiasing approach to correct attention biases without introducing new pruning methods or additional training processes [6][12] - By modeling the overall trends of attention biases, the team effectively reduces irrelevant positional factors, enhancing the semantic relevance of attention [6][12] Group 3: Experimental Results - The debiasing strategy was integrated as a plug-and-play module into various mainstream attention-based visual token pruning methods, showing consistent performance improvements across multiple tasks [7][10] - Experimental results indicate that the pruning models with the debiasing correction achieved stable performance enhancements, particularly under aggressive token compression conditions [10][12] Group 4: Conclusion - The findings emphasize that attention is not inherently equivalent to semantic importance, and ignoring inherent structural biases can mislead pruning strategies, affecting overall model performance [12]
微软研究院杨玉庆:Agent 的注意力系统|Attention
3 6 Ke· 2025-09-05 03:42
Core Insights - The article discusses TriangleMix, a structural optimization method for attention mechanisms in large models, which addresses the computational bottleneck during the prefill stage while maintaining performance and accuracy [2][5][10] - TriangleMix allows for a hierarchical sparse attention architecture that significantly reduces latency and memory consumption, making it suitable for long-context tasks [8][10][36] Technical Overview - TriangleMix employs a layered attention strategy, using standard dense attention in the first 16 layers and switching to a triangle-shaped mask in the subsequent layers, which reduces computational complexity from O(N²) to O(N) [5][6] - The method has been tested on models like Llama-3.1-8B-Instruct, showing a kernel latency reduction from 750ms to 49ms, achieving a speedup of 15.3x and a decrease in time to first token (TTFT) by 12%-32% [10][9] Performance Metrics - Experimental results indicate that TriangleMix retains 99.7% of the original performance while applying the triangle attention in the majority of the deep layers [8][10] - The method demonstrates significant reductions in latency and memory usage with almost no loss in accuracy across various benchmark tasks [10][9] Broader Implications - The research emphasizes the importance of viewing attention mechanisms within the larger context of agent systems, training mechanisms, and task structures, rather than as isolated components [12][26] - The ongoing work at Microsoft Research focuses on optimizing agent-native systems, which aim to enhance the efficiency and effectiveness of AI applications, particularly for users with specific needs [15][67]