上下文知识理解

Search documents
ICML 2025 | 注意力机制中的极大值:破解大语言模型上下文理解的关键
机器之心· 2025-05-06 04:11
Core Insights - The article discusses a significant phenomenon in large language models (LLMs) related to the concentration of massive values in the self-attention mechanism, particularly in the query (Q) and key (K) representations, which is crucial for contextual knowledge understanding [1][3][4]. Research Highlights - The study reveals that massive values are highly concentrated in Q and K, which is contrary to the expectation of independent operations in each attention head. This consistency across multiple layers and heads is visually demonstrated [3][4]. - The phenomenon of massive values is specifically observed in models using Rotational Position Encoding (RoPE), such as LLaMA, Qwen, and Gemma, while models without RoPE, like GPT-2 and OPT, do not exhibit this pattern [4]. - The research establishes a direct link between the presence of massive values in Q and K and the ability to understand contextual knowledge [4]. Key Findings 1. **Concentration of Massive Values**: Massive values are found to be highly concentrated in specific regions of each attention head, indicating a surprising level of consistency [3][4]. 2. **Impact on Contextual Knowledge Understanding**: The study shows that the presence of massive values is critical for understanding contextual knowledge, as demonstrated through destructive experiments that reset these values to their average [5][6]. 3. **Quantization Techniques**: Specific quantization methods that address massive values, such as AWQ and SmoothQuant, are shown to better preserve contextual knowledge understanding compared to methods that do not focus on massive values [7]. 4. **Origin of Concentration Phenomenon**: The concentration of massive values is attributed to RoPE, which affects low-frequency regions in Q and K, leading to this phenomenon appearing from the early layers of the model [8]. Experimental Results - The experiments reveal a stark contrast in the impact of massive values on different knowledge tasks: - **Resilience in Parametric Knowledge Retrieval**: Tasks relying on parametric knowledge show a decline of only 15-20% in accuracy when massive values are disrupted, maintaining 76%-88% accuracy [10]. - **Catastrophic Decline in Contextual Knowledge Tasks**: Tasks requiring contextual understanding experience a drastic drop in performance, with accuracy in key retrieval tasks plummeting from 100% to near 0% when massive values are disrupted [11]. - **Control Experiments**: When only non-massive values are disrupted, task performance remains stable, confirming the unique importance of massive values in contextual understanding [12]. Future Directions - The research opens several avenues for further exploration, including enhancing or adjusting the distribution of massive values to improve contextual understanding, examining the universality of this phenomenon across different architectures, and designing targeted quantization methods to protect massive values related to contextual understanding [16].