不用额外缓存,英伟达开源大模型记忆压缩方案,128K上下文提速2.7倍
NvidiaNvidia(US:NVDA) 3 6 Ke·2026-01-14 08:22

Core Viewpoint - Nvidia, a leader in large model open-source technology, has introduced the TTT-E2E method in collaboration with various academic institutions to enhance memory capabilities in large models, achieving significant speed improvements in processing long texts [1][3]. Group 1: TTT-E2E Method Overview - TTT-E2E processes 128K long texts 2.7 times faster than full attention models and achieves a 35-fold speed increase when handling 2M contexts, without compromising performance [1]. - Unlike the recently popular DeepSeek memory module, TTT-E2E employs dynamic learning through context compression, allowing the model to maintain a learning state during testing [3][6]. - The method is based on a standard Transformer architecture with sliding window attention, making it easy to deploy [6]. Group 2: Continuous Learning Approach - TTT-E2E transforms long text modeling into a "continuous learning" task, where the model predicts the next word based on the current context and updates its parameters through gradient descent [6]. - The training phase utilizes meta-learning to prepare the model for a "test-time learning" mode, ensuring quick adaptation to testing requirements [6]. Group 3: Key Optimizations - TTT-E2E incorporates three key optimizations: a combination of mini-batch processing and sliding window attention, a precise update strategy focusing on the MLP layer, and a dual MLP design to store pre-trained knowledge while absorbing new context [8][9]. - The model's performance is impressive, with testing losses on a 3B parameter model at 128K context lengths being comparable or superior to full attention Transformers, while other models like Mamba 2 and Gated DeltaNet show significant performance drops in long text scenarios [9]. Group 4: Performance and Limitations - TTT-E2E maintains consistent inference latency regardless of context length, providing a uniform fast response experience for both 8K and 128K texts [13]. - However, it struggles with tasks requiring precise detail recall, as its memory compression may filter out seemingly irrelevant details, unlike full attention models that can recall information with minimal loss [13]. - The meta-learning process during training is currently slower than standard pre-training due to the need for gradient calculations [13]. Group 5: Research and Development - The project is led by Yu Sun, a postdoctoral researcher at Stanford, who has been developing the "test-time training" concept since 2019, with TTT-E2E being an early idea he proposed [15].

Nvidia-不用额外缓存,英伟达开源大模型记忆压缩方案,128K上下文提速2.7倍 - Reportify