测试时扩展
Search documents
首个测试时共进化合成框架TTCS:在「左右互搏」中突破推理瓶颈
机器之心· 2026-02-10 08:52
Core Insights - The article discusses the emergence of the Test-Time Curriculum Synthesis (TTCS) framework, which addresses challenges in Test-Time Training (TTT) by generating curriculum data that aligns with the model's capability frontier, thus enhancing performance on difficult test problems [2][10][30] Group 1: Motivation and Background - The shift in focus from merely expanding parameters in large language models (LLMs) to leveraging Test-Time Scaling for effective training is highlighted as a core motivation [5] - The existing TTT methods struggle with high-difficulty test questions due to noisy pseudo-labels, leading to ineffective learning [2][7] Group 2: Methodology - TTCS operates through a co-evolutionary framework involving two agents: the Synthesizer, which generates questions at the model's capability frontier, and the Solver, which attempts to solve these questions [11][14] - A capability-adaptive reward mechanism is implemented to ensure that the generated questions are neither too easy nor too difficult, facilitating a dynamic learning environment [16] Group 3: Experimental Results - TTCS demonstrated significant improvements in mathematical reasoning scores, with Qwen2.5-Math-1.5B achieving an average score of 41.49, up from 17.30, marking an increase of +24.19 [3][20] - In challenging AIME competition problems, TTCS outperformed strong baselines like TTRL, showcasing its effectiveness in tackling high-difficulty questions [22][23] Group 4: Broader Implications - The framework not only enhances performance in mathematics but also shows generalization capabilities across various reasoning tasks, indicating that the model learns universal reasoning logic rather than overfitting [22] - The findings suggest that adaptive teaching methods (dynamic Synthesizer) are more effective than static high-level models, emphasizing the importance of tailored learning experiences [25][26] Group 5: Conclusion and Future Outlook - TTCS represents a reconstruction of the Test-Time Computing paradigm, positioning models as active curriculum designers rather than passive problem solvers [30] - The framework addresses critical issues of data scarcity and difficulty gaps in test-time training, paving the way for future self-evolving agents capable of continuous evolution in unknown environments [30]
“人类最后的考试”,中国模型赢了GPT-5
2 1 Shi Ji Jing Ji Bao Dao· 2025-11-15 08:01
Core Insights - The founders of Moonlight Dark Side introduced the Kimi K2 Thinking model, which outperformed GPT-5 in several benchmark tests, generating significant interest in the global AI community [1][2] Model Performance - Kimi K2 Thinking is described as the strongest open-source thinking model to date, achieving state-of-the-art (SOTA) performance in various tests, including 44.9% in the Humanity's Last Exam (HLE) compared to GPT-5's 41.7% [2] - The model demonstrated a score of 60.2% in the BrowseComp benchmark and 56.3% in the SEAL-0 test, both surpassing GPT-5 [2] - Kimi K2 Thinking can autonomously perform up to 300 steps of tool invocation, showcasing its advanced reasoning capabilities [2][3] Technical Innovations - The model employs a "thinking-tool-thinking-tool" execution pattern, which is relatively novel in large language models [4] - The team utilized end-to-end reinforcement learning to maintain performance stability during extensive tool invocation processes [4] - Kimi K2 Thinking incorporates native INT4 quantization technology, enhancing generation speed by approximately 2 times [7] Cost and Resource Management - The team operates on a limited computing resource setup, utilizing H800 GPU clusters, and has optimized performance to maximize the capabilities of each GPU [5][6] - The actual training cost is difficult to quantify, with the previously mentioned figure of $4.6 million not being an official number [6] Market Position and Strategy - The open-source strategy of Moonlight Dark Side has led to increased international recognition for Chinese AI models, particularly after the ban on Chinese IPs from accessing certain models [7][8] - Kimi K2's API pricing is significantly lower than competitors, enhancing its competitive edge in the market [7] Future Developments - The company is planning to introduce the next-generation K3 model, which will feature significant architectural changes, including the experimental KDA (Kimi Delta Attention) module [10]