大模型记忆压缩
Search documents
不用额外缓存,英伟达开源大模型记忆压缩方案,128K上下文提速2.7倍
3 6 Ke· 2026-01-14 08:22
Core Viewpoint - Nvidia, a leader in large model open-source technology, has introduced the TTT-E2E method in collaboration with various academic institutions to enhance memory capabilities in large models, achieving significant speed improvements in processing long texts [1][3]. Group 1: TTT-E2E Method Overview - TTT-E2E processes 128K long texts 2.7 times faster than full attention models and achieves a 35-fold speed increase when handling 2M contexts, without compromising performance [1]. - Unlike the recently popular DeepSeek memory module, TTT-E2E employs dynamic learning through context compression, allowing the model to maintain a learning state during testing [3][6]. - The method is based on a standard Transformer architecture with sliding window attention, making it easy to deploy [6]. Group 2: Continuous Learning Approach - TTT-E2E transforms long text modeling into a "continuous learning" task, where the model predicts the next word based on the current context and updates its parameters through gradient descent [6]. - The training phase utilizes meta-learning to prepare the model for a "test-time learning" mode, ensuring quick adaptation to testing requirements [6]. Group 3: Key Optimizations - TTT-E2E incorporates three key optimizations: a combination of mini-batch processing and sliding window attention, a precise update strategy focusing on the MLP layer, and a dual MLP design to store pre-trained knowledge while absorbing new context [8][9]. - The model's performance is impressive, with testing losses on a 3B parameter model at 128K context lengths being comparable or superior to full attention Transformers, while other models like Mamba 2 and Gated DeltaNet show significant performance drops in long text scenarios [9]. Group 4: Performance and Limitations - TTT-E2E maintains consistent inference latency regardless of context length, providing a uniform fast response experience for both 8K and 128K texts [13]. - However, it struggles with tasks requiring precise detail recall, as its memory compression may filter out seemingly irrelevant details, unlike full attention models that can recall information with minimal loss [13]. - The meta-learning process during training is currently slower than standard pre-training due to the need for gradient calculations [13]. Group 5: Research and Development - The project is led by Yu Sun, a postdoctoral researcher at Stanford, who has been developing the "test-time training" concept since 2019, with TTT-E2E being an early idea he proposed [15].
不用额外缓存!英伟达开源大模型记忆压缩方案,128K上下文提速2.7倍
量子位· 2026-01-14 04:42
Core Viewpoint - Nvidia has introduced the TTT-E2E method in collaboration with various research institutions to enhance memory capabilities in large models, significantly improving processing speed and efficiency for long texts [1][2]. Group 1: TTT-E2E Method Overview - TTT-E2E processes 128K long texts 2.7 times faster than full attention models and achieves a 35-fold speedup when handling 2M contexts, without compromising performance [3]. - Unlike the recently popular DeepSeek memory module, TTT-E2E employs dynamic learning through context compression rather than static learning paths [5][6]. - The method allows real-time learning, compressing key content into model weights, enabling the model to maintain a learning state during testing [7][8]. Group 2: Technical Implementation - TTT-E2E is based on a standard Transformer with sliding window attention, making it easy to deploy without relying on complex architectures [11]. - The core idea shifts long text modeling from an architectural design issue to a "continuous learning" task [12]. - During testing, the model predicts the next word based on the current context, updating its parameters through gradient descent to dynamically compress information into its weights [13]. Group 3: Training and Optimization - The training phase utilizes meta-learning to prepare the model for "test-time learning," simulating each training sequence as a test sequence [14]. - TTT-E2E incorporates three key optimizations: a combination of mini-batch processing with sliding windows, precise update strategies focusing on specific layers, and a dual MLP design to balance new context absorption with pre-trained knowledge [16][17]. Group 4: Performance and Limitations - Experimental data shows TTT-E2E performs comparably or better than full attention Transformers in terms of test loss, while maintaining consistent inference latency regardless of context length [19][23]. - In tasks requiring precise detail recall, TTT-E2E's performance is inferior to full attention models due to its memory compression approach, which filters out seemingly irrelevant details [25][26]. - The meta-learning process in the training phase is currently slower than standard pre-training methods [27]. Group 5: Research and Development - The project is led by Yu Sun, a postdoctoral researcher at Stanford, who aims to enable AI systems to learn continuously like humans [29][30]. - The code and related papers for TTT-E2E have been fully open-sourced [28].