测试时训练
Search documents
首个测试时共进化合成框架TTCS:在「左右互搏」中突破推理瓶颈
机器之心· 2026-02-10 08:52
Core Insights - The article discusses the emergence of the Test-Time Curriculum Synthesis (TTCS) framework, which addresses challenges in Test-Time Training (TTT) by generating curriculum data that aligns with the model's capability frontier, thus enhancing performance on difficult test problems [2][10][30] Group 1: Motivation and Background - The shift in focus from merely expanding parameters in large language models (LLMs) to leveraging Test-Time Scaling for effective training is highlighted as a core motivation [5] - The existing TTT methods struggle with high-difficulty test questions due to noisy pseudo-labels, leading to ineffective learning [2][7] Group 2: Methodology - TTCS operates through a co-evolutionary framework involving two agents: the Synthesizer, which generates questions at the model's capability frontier, and the Solver, which attempts to solve these questions [11][14] - A capability-adaptive reward mechanism is implemented to ensure that the generated questions are neither too easy nor too difficult, facilitating a dynamic learning environment [16] Group 3: Experimental Results - TTCS demonstrated significant improvements in mathematical reasoning scores, with Qwen2.5-Math-1.5B achieving an average score of 41.49, up from 17.30, marking an increase of +24.19 [3][20] - In challenging AIME competition problems, TTCS outperformed strong baselines like TTRL, showcasing its effectiveness in tackling high-difficulty questions [22][23] Group 4: Broader Implications - The framework not only enhances performance in mathematics but also shows generalization capabilities across various reasoning tasks, indicating that the model learns universal reasoning logic rather than overfitting [22] - The findings suggest that adaptive teaching methods (dynamic Synthesizer) are more effective than static high-level models, emphasizing the importance of tailored learning experiences [25][26] Group 5: Conclusion and Future Outlook - TTCS represents a reconstruction of the Test-Time Computing paradigm, positioning models as active curriculum designers rather than passive problem solvers [30] - The framework addresses critical issues of data scarcity and difficulty gaps in test-time training, paving the way for future self-evolving agents capable of continuous evolution in unknown environments [30]
不用额外缓存,英伟达开源大模型记忆压缩方案,128K上下文提速2.7倍
3 6 Ke· 2026-01-14 08:22
Core Viewpoint - Nvidia, a leader in large model open-source technology, has introduced the TTT-E2E method in collaboration with various academic institutions to enhance memory capabilities in large models, achieving significant speed improvements in processing long texts [1][3]. Group 1: TTT-E2E Method Overview - TTT-E2E processes 128K long texts 2.7 times faster than full attention models and achieves a 35-fold speed increase when handling 2M contexts, without compromising performance [1]. - Unlike the recently popular DeepSeek memory module, TTT-E2E employs dynamic learning through context compression, allowing the model to maintain a learning state during testing [3][6]. - The method is based on a standard Transformer architecture with sliding window attention, making it easy to deploy [6]. Group 2: Continuous Learning Approach - TTT-E2E transforms long text modeling into a "continuous learning" task, where the model predicts the next word based on the current context and updates its parameters through gradient descent [6]. - The training phase utilizes meta-learning to prepare the model for a "test-time learning" mode, ensuring quick adaptation to testing requirements [6]. Group 3: Key Optimizations - TTT-E2E incorporates three key optimizations: a combination of mini-batch processing and sliding window attention, a precise update strategy focusing on the MLP layer, and a dual MLP design to store pre-trained knowledge while absorbing new context [8][9]. - The model's performance is impressive, with testing losses on a 3B parameter model at 128K context lengths being comparable or superior to full attention Transformers, while other models like Mamba 2 and Gated DeltaNet show significant performance drops in long text scenarios [9]. Group 4: Performance and Limitations - TTT-E2E maintains consistent inference latency regardless of context length, providing a uniform fast response experience for both 8K and 128K texts [13]. - However, it struggles with tasks requiring precise detail recall, as its memory compression may filter out seemingly irrelevant details, unlike full attention models that can recall information with minimal loss [13]. - The meta-learning process during training is currently slower than standard pre-training due to the need for gradient calculations [13]. Group 5: Research and Development - The project is led by Yu Sun, a postdoctoral researcher at Stanford, who has been developing the "test-time training" concept since 2019, with TTT-E2E being an early idea he proposed [15].
不用额外缓存!英伟达开源大模型记忆压缩方案,128K上下文提速2.7倍
量子位· 2026-01-14 04:42
Core Viewpoint - Nvidia has introduced the TTT-E2E method in collaboration with various research institutions to enhance memory capabilities in large models, significantly improving processing speed and efficiency for long texts [1][2]. Group 1: TTT-E2E Method Overview - TTT-E2E processes 128K long texts 2.7 times faster than full attention models and achieves a 35-fold speedup when handling 2M contexts, without compromising performance [3]. - Unlike the recently popular DeepSeek memory module, TTT-E2E employs dynamic learning through context compression rather than static learning paths [5][6]. - The method allows real-time learning, compressing key content into model weights, enabling the model to maintain a learning state during testing [7][8]. Group 2: Technical Implementation - TTT-E2E is based on a standard Transformer with sliding window attention, making it easy to deploy without relying on complex architectures [11]. - The core idea shifts long text modeling from an architectural design issue to a "continuous learning" task [12]. - During testing, the model predicts the next word based on the current context, updating its parameters through gradient descent to dynamically compress information into its weights [13]. Group 3: Training and Optimization - The training phase utilizes meta-learning to prepare the model for "test-time learning," simulating each training sequence as a test sequence [14]. - TTT-E2E incorporates three key optimizations: a combination of mini-batch processing with sliding windows, precise update strategies focusing on specific layers, and a dual MLP design to balance new context absorption with pre-trained knowledge [16][17]. Group 4: Performance and Limitations - Experimental data shows TTT-E2E performs comparably or better than full attention Transformers in terms of test loss, while maintaining consistent inference latency regardless of context length [19][23]. - In tasks requiring precise detail recall, TTT-E2E's performance is inferior to full attention models due to its memory compression approach, which filters out seemingly irrelevant details [25][26]. - The meta-learning process in the training phase is currently slower than standard pre-training methods [27]. Group 5: Research and Development - The project is led by Yu Sun, a postdoctoral researcher at Stanford, who aims to enable AI systems to learn continuously like humans [29][30]. - The code and related papers for TTT-E2E have been fully open-sourced [28].