长文本建模

Search documents
ICML 2025 | 千倍长度泛化!蚂蚁新注意力机制GCA实现16M长上下文精准理解
机器之心· 2025-06-13 15:45
Core Viewpoint - The article discusses the challenges of long text modeling in large language models (LLMs) and introduces a new attention mechanism called Grouped Cross Attention (GCA) that enhances the ability to process long contexts efficiently, potentially paving the way for advancements in artificial general intelligence (AGI) [1][2]. Long Text Processing Challenges and Existing Solutions - Long text modeling remains challenging due to the quadratic complexity of the Transformer architecture and the limited extrapolation capabilities of full-attention mechanisms [1][6]. - Existing solutions, such as sliding window attention, sacrifice long-range information retrieval for continuous generation, while other methods have limited generalization capabilities [7][8]. GCA Mechanism - GCA is a novel attention mechanism that learns to retrieve and select relevant past segments of text, significantly reducing memory overhead during long text processing [2][9]. - The mechanism operates in two stages: first, it performs attention on each chunk separately, and then it fuses the information from these chunks to predict the next token [14][15]. Experimental Results - Models incorporating GCA demonstrated superior performance on long text datasets, achieving over 1000 times length generalization and 100% accuracy in 16M long context retrieval tasks [5][17]. - The GCA model's training costs scale linearly with sequence length, and its inference memory overhead approaches a constant, maintaining efficient processing speeds [20][21]. Conclusion - The introduction of GCA represents a significant advancement in the field of long-context language modeling, with the potential to facilitate the development of intelligent agents with permanent memory capabilities [23].
ICML 2025 | 全局池化+局部保留,CCA-Attention为LLM长文本建模带来突破性进展
机器之心· 2025-06-08 08:21
Core Insights - The article discusses the introduction of the Core Context Aware Attention mechanism (CCA-Attention) developed by the Pazhou Laboratory and South China University of Technology, which significantly enhances the efficiency of long text context modeling [1][3] - CCA-Attention achieves a reasoning speed that is 7.9 times faster than standard self-attention mechanisms while reducing key-value cache memory usage by 93%, setting a new benchmark for long text processing [3][26] Summary by Sections Introduction - CCA-Attention has been accepted for ICML 2025 and is set to be submitted to ArXiv on December 17, 2024, ahead of other models like DeepSeek NSA and Kimi MoBA [3][8] Research Findings - Recent studies indicate that attention weights in large language models (LLMs) are concentrated on a few tokens, demonstrating significant sparsity, which can be leveraged to reduce computational complexity [4][5] Existing Methods - Current sparse attention methods often rely on predefined patterns, which may limit the model's ability to access critical information spread across different positions in the context [6] Proposed Solution - CCA-Attention is designed to efficiently model long texts by combining global pooling attention with local retention attention, significantly lowering computational costs while maintaining long-distance dependency modeling capabilities [7][11] Mechanism Details - The mechanism consists of two complementary modules: - Global Pooling Module: Extracts core tokens based on the importance of input tokens for subsequent attention calculations [29] - Local Retention Module: Focuses on nearby tokens to capture fine-grained contextual information, complementing the global pooling module [30] Performance Evaluation - CCA-Attention was applied to LLaMA2-7B models and compared against efficient attention methods like StreamingLLM, LM-Infinite, and MInference, showing superior performance in long text tasks [20][21] - In the LongBench-E benchmark, CCA-LLM achieved the highest average score, outperforming other methods in both LLaMA2-7B-32K and LLaMA2-7B-80K models [21][22] Efficiency Metrics - CCA-Attention demonstrated significant advantages in inference speed and memory usage, achieving a speedup of 5.7 times at 64K context length and 7.9 times at 128K context length compared to standard self-attention [26][25] - The memory usage for key-value cache was reduced by up to 93%, highlighting its efficiency in long sequence modeling [26][31]