ICML 2025 | 全局池化+局部保留，CCA-Attention为LLM长文本建模带来突破性进展

Core Insights - The article discusses the introduction of the Core Context Aware Attention mechanism (CCA-Attention) developed by the Pazhou Laboratory and South China University of Technology, which significantly enhances the efficiency of long text context modeling [1][3] - CCA-Attention achieves a reasoning speed that is 7.9 times faster than standard self-attention mechanisms while reducing key-value cache memory usage by 93%, setting a new benchmark for long text processing [3][26] Summary by Sections Introduction - CCA-Attention has been accepted for ICML 2025 and is set to be submitted to ArXiv on December 17, 2024, ahead of other models like DeepSeek NSA and Kimi MoBA [3][8] Research Findings - Recent studies indicate that attention weights in large language models (LLMs) are concentrated on a few tokens, demonstrating significant sparsity, which can be leveraged to reduce computational complexity [4][5] Existing Methods - Current sparse attention methods often rely on predefined patterns, which may limit the model's ability to access critical information spread across different positions in the context [6] Proposed Solution - CCA-Attention is designed to efficiently model long texts by combining global pooling attention with local retention attention, significantly lowering computational costs while maintaining long-distance dependency modeling capabilities [7][11] Mechanism Details - The mechanism consists of two complementary modules: - Global Pooling Module: Extracts core tokens based on the importance of input tokens for subsequent attention calculations [29] - Local Retention Module: Focuses on nearby tokens to capture fine-grained contextual information, complementing the global pooling module [30] Performance Evaluation - CCA-Attention was applied to LLaMA2-7B models and compared against efficient attention methods like StreamingLLM, LM-Infinite, and MInference, showing superior performance in long text tasks [20][21] - In the LongBench-E benchmark, CCA-LLM achieved the highest average score, outperforming other methods in both LLaMA2-7B-32K and LLaMA2-7B-80K models [21][22] Efficiency Metrics - CCA-Attention demonstrated significant advantages in inference speed and memory usage, achieving a speedup of 5.7 times at 64K context length and 7.9 times at 128K context length compared to standard self-attention [26][25] - The memory usage for key-value cache was reduced by up to 93%, highlighting its efficiency in long sequence modeling [26][31]