长度泛化

Search documents
Transformer死角,只需500步后训练,循环模型突破256k长度泛化极限
机器之心· 2025-07-08 04:09
Core Insights - The article discusses the advantages of linear recurrent models, such as Mamba, and linear attention mechanisms in handling long sequences, which is crucial for long-context reasoning tasks [1][2] - It highlights the performance improvements of recurrent models over time, indicating that they can now compete with Transformers in various tasks, despite previous limitations [3] - A significant finding is that recurrent models struggle with generalization beyond training lengths, leading to performance drops when faced with longer sequences [4][6] Group 1 - The article presents a solution to the generalization issue in recurrent models through simple training interventions, allowing them to generalize to sequences up to 256k in length with just 500 additional training steps [7] - The research emphasizes that recurrent models possess untapped potential rather than inherent flaws [7][8] - The authors propose the "Unexplored States Hypothesis" to explain why recurrent models fail to generalize in length, indicating that they only learn from a limited subset of possible states during training [13][14] Group 2 - The article outlines four training interventions to improve length generalization by altering the initial state of the model [19] - These interventions include Random Noise, Fitted Noise, State Passing, and Truncated Backpropagation Through Time (TBTT), each designed to expose the model to a broader range of state distributions [20][19] - The findings reveal that State Passing and TBTT mechanisms effectively enable length generalization, achieving results with only 0.02% of the original pre-training budget [23][24] Group 3 - The article discusses the performance of these interventions in various long-context tasks, demonstrating their ability to enhance length generalization [31] - Specific tasks mentioned include the BABILong benchmark, password retrieval, and synthetic copying tasks, where the interventions significantly improved model performance [32][35][39] - The results indicate that models trained with these interventions can effectively utilize relationships between tokens beyond the training context length [36][39] Group 4 - The article introduces the concept of "Effective Remembrance" to measure how well a model retains information from previous tokens, aiming for models to focus on recent context rather than distant tokens [44][50] - It shows that State Passing improves effective memory, allowing models to prioritize recent tokens in their predictions [51][52] - This adjustment is crucial for text modeling, ensuring that earlier tokens do not disproportionately influence the model's output [52]
ICML 2025 | 千倍长度泛化!蚂蚁新注意力机制GCA实现16M长上下文精准理解
机器之心· 2025-06-13 15:45
Core Viewpoint - The article discusses the challenges of long text modeling in large language models (LLMs) and introduces a new attention mechanism called Grouped Cross Attention (GCA) that enhances the ability to process long contexts efficiently, potentially paving the way for advancements in artificial general intelligence (AGI) [1][2]. Long Text Processing Challenges and Existing Solutions - Long text modeling remains challenging due to the quadratic complexity of the Transformer architecture and the limited extrapolation capabilities of full-attention mechanisms [1][6]. - Existing solutions, such as sliding window attention, sacrifice long-range information retrieval for continuous generation, while other methods have limited generalization capabilities [7][8]. GCA Mechanism - GCA is a novel attention mechanism that learns to retrieve and select relevant past segments of text, significantly reducing memory overhead during long text processing [2][9]. - The mechanism operates in two stages: first, it performs attention on each chunk separately, and then it fuses the information from these chunks to predict the next token [14][15]. Experimental Results - Models incorporating GCA demonstrated superior performance on long text datasets, achieving over 1000 times length generalization and 100% accuracy in 16M long context retrieval tasks [5][17]. - The GCA model's training costs scale linearly with sequence length, and its inference memory overhead approaches a constant, maintaining efficient processing speeds [20][21]. Conclusion - The introduction of GCA represents a significant advancement in the field of long-context language modeling, with the potential to facilitate the development of intelligent agents with permanent memory capabilities [23].