比人类专家快2倍,斯坦福联合英伟达发布TTT-Discover:用「测试时强化学习」攻克科学难题
NvidiaNvidia(US:NVDA) 机器之心·2026-01-28 04:59

Core Viewpoint - The article discusses a new method called "Test-Time Training to Discover" (TTT-Discover) that enhances the capabilities of large language models (LLMs) by allowing them to learn continuously during the testing phase, rather than just searching for solutions [4][8]. Summary by Sections Introduction to AI and Problem Solving - The industry is exploring how to leverage AI to discover optimal solutions to scientific problems, with a common approach being "test-time search" using frozen LLMs [1]. - While these prompts can improve LLMs' previous solutions, they do not lead to genuine learning or internalization of new concepts [2]. Learning vs. Searching - The article emphasizes that true progress in LLMs comes from learning rather than searching, especially for complex problems like Go and protein folding, where learning has historically outperformed searching [3]. TTT-Discover Methodology - TTT-Discover involves applying reinforcement learning (RL) during the testing phase, allowing LLMs to continuously train while solving specific problems [4]. - The method focuses on producing a single high-quality solution rather than multiple average solutions, which is a departure from standard RL objectives [6][13]. Results and Achievements - TTT-Discover has shown impressive results across various tasks, outperforming DeepMind's AlphaEvolve and achieving breakthroughs in mathematical problems and GPU kernel development [7][22]. - The method demonstrated a 50% speed improvement over the best human submissions in GPU kernel optimization tasks [22]. Performance Evaluation - The evaluation of TTT-Discover was conducted in four distinct fields: mathematics, GPU kernel engineering, algorithm design, and biology, comparing its performance against human experts [19]. - In the Erdős minimum overlap problem, TTT-Discover achieved a score of 0.380876, surpassing the previous best AI score of 0.380924 [20]. Technical Innovations - TTT-Discover incorporates an entropy objective function and a state reuse strategy inspired by PUCT to prioritize the discovery of the highest reward solutions [14][15]. - The combination of these components allows TTT-Discover to focus on the most promising solution paths while maintaining diversity in its search [17]. Future Directions - While TTT-Discover has achieved significant results, the team acknowledges that its current form is limited to problems with continuous rewards, with future work aimed at addressing challenges in sparse or binary reward scenarios [26].

比人类专家快2倍,斯坦福联合英伟达发布TTT-Discover:用「测试时强化学习」攻克科学难题 - Reportify