测试时强化学习
Search documents
比人类专家快2倍,斯坦福联合英伟达发布TTT-Discover:用「测试时强化学习」攻克科学难题
3 6 Ke· 2026-01-28 08:01
Core Insights - The article discusses the innovative approach of "Test-Time Training to Discover" (TTT-Discover) which allows large language models (LLMs) to improve their performance on specific tasks through reinforcement learning during the testing phase, rather than relying solely on pre-trained knowledge [2][3][4]. Group 1: Methodology - TTT-Discover defines a single test problem as an environment where standard reinforcement learning techniques can be applied, focusing on producing a single optimal solution rather than averaging multiple solutions [3][9]. - The method incorporates two key components: an entropy objective function that biases towards high-reward samples and a state reuse strategy inspired by PUCT, which prioritizes the most promising paths during the search process [9][10]. Group 2: Results and Achievements - TTT-Discover has shown significant success across various tasks, outperforming existing models such as DeepMind's AlphaEvolve and achieving new records in mathematical problems and GPU kernel optimization [3][11][13]. - In the Erdős minimum overlap problem, TTT-Discover achieved a score of 0.380876, surpassing the previous best human score of 0.380927 and the AI best of 0.380924 [11][12]. - The TriMul kernel developed using TTT-Discover was found to be 50% faster than the best human submission on A100 GPUs, with an overall performance improvement of over 15% across all GPU types compared to human best results [13][14]. Group 3: Future Directions - The team acknowledges that TTT-Discover is currently limited to problems with continuous rewards and aims to extend its applicability to areas with sparse or binary rewards, such as mathematical proofs and scientific hypotheses [17].
比人类专家快2倍,斯坦福联合英伟达发布TTT-Discover:用「测试时强化学习」攻克科学难题
机器之心· 2026-01-28 04:59
Core Viewpoint - The article discusses a new method called "Test-Time Training to Discover" (TTT-Discover) that enhances the capabilities of large language models (LLMs) by allowing them to learn continuously during the testing phase, rather than just searching for solutions [4][8]. Summary by Sections Introduction to AI and Problem Solving - The industry is exploring how to leverage AI to discover optimal solutions to scientific problems, with a common approach being "test-time search" using frozen LLMs [1]. - While these prompts can improve LLMs' previous solutions, they do not lead to genuine learning or internalization of new concepts [2]. Learning vs. Searching - The article emphasizes that true progress in LLMs comes from learning rather than searching, especially for complex problems like Go and protein folding, where learning has historically outperformed searching [3]. TTT-Discover Methodology - TTT-Discover involves applying reinforcement learning (RL) during the testing phase, allowing LLMs to continuously train while solving specific problems [4]. - The method focuses on producing a single high-quality solution rather than multiple average solutions, which is a departure from standard RL objectives [6][13]. Results and Achievements - TTT-Discover has shown impressive results across various tasks, outperforming DeepMind's AlphaEvolve and achieving breakthroughs in mathematical problems and GPU kernel development [7][22]. - The method demonstrated a 50% speed improvement over the best human submissions in GPU kernel optimization tasks [22]. Performance Evaluation - The evaluation of TTT-Discover was conducted in four distinct fields: mathematics, GPU kernel engineering, algorithm design, and biology, comparing its performance against human experts [19]. - In the Erdős minimum overlap problem, TTT-Discover achieved a score of 0.380876, surpassing the previous best AI score of 0.380924 [20]. Technical Innovations - TTT-Discover incorporates an entropy objective function and a state reuse strategy inspired by PUCT to prioritize the discovery of the highest reward solutions [14][15]. - The combination of these components allows TTT-Discover to focus on the most promising solution paths while maintaining diversity in its search [17]. Future Directions - While TTT-Discover has achieved significant results, the team acknowledges that its current form is limited to problems with continuous rewards, with future work aimed at addressing challenges in sparse or binary reward scenarios [26].
斯坦福英伟达推出测试时强化学习:微调开源模型胜过顶级闭源模型,仅需几百美元
3 6 Ke· 2026-01-27 09:17
Core Insights - The article discusses a new approach called Test-Time Training to Discover (TTT-Discover) developed by researchers from Stanford and NVIDIA, aimed at solving open scientific problems through real-time learning during the testing phase [1][2]. Group 1: Methodology - TTT-Discover is based on the open-source model gpt-oss-120b, achieving state-of-the-art (SOTA) results across multiple fields, outperforming human experts and closed-source models [2]. - Unlike traditional methods that rely on "Test-time Scaling" and prompt scheduling, TTT-Discover employs reinforcement learning (RL) to update model weights during testing, allowing the model to learn from failures in real-time [2][5]. - The approach introduces an entropy objective function, focusing on generating a single optimal solution rather than multiple mediocre ones, which contrasts with traditional RL that aims for average rewards [3][7]. Group 2: Implementation - TTT-Discover maintains a buffer of historical attempts, prioritizing the expansion of the most promising states while balancing exploration and exploitation [4]. - The model generates actions based on feedback from the environment, creating a "private dataset" for specific problems, thus addressing the out-of-distribution (OOD) challenge [5]. - The algorithm utilizes a strategy to select the most promising existing solutions as starting points, generates new attempts, evaluates results, and updates model weights accordingly [8]. Group 3: Performance and Applications - In experimental settings, TTT-Discover demonstrated a speed advantage, being approximately twice as fast as the best human implementations in kernel engineering tasks [10]. - The method is particularly effective in continuous (verifiable) reward scenarios, with future work planned to extend its applicability to sparse rewards, binary rewards, and unverified domains [10]. Group 4: Authors and Research Background - The primary authors of the paper are Mert Yuksekgonul and Daniel Koceja, with Yuksekgonul pursuing a PhD at Stanford University [11][13]. - Yu Sun, the corresponding author, is a postdoctoral researcher at Stanford and a researcher at NVIDIA, focusing on continual learning and test-time training since 2019 [14][16].
斯坦福英伟达推出测试时强化学习:微调开源模型胜过顶级闭源模型,仅需几百美元
量子位· 2026-01-27 02:33
Core Insights - The article discusses a new approach called Test-Time Training to Discover (TTT-Discover), which aims to solve open scientific problems by incorporating reinforcement learning during the testing phase of model evaluation [1][2]. Group 1: Methodology - TTT-Discover is based on the open-source model gpt-oss-120b and achieves state-of-the-art (SOTA) performance across multiple domains, outperforming human experts and closed-source models [3]. - Unlike traditional methods that rely on "Test-time Scaling" through prompt scheduling, TTT-Discover updates model weights during the testing phase to learn from specific problems [4][5]. - This "test-time training" allows the model to gain real-time experience from failed attempts, leading to a directed evolution of its capabilities [6]. Group 2: Learning Objectives - TTT-Discover employs an Entropic Objective, which focuses on maximizing the reward for the best actions rather than average rewards across all tasks, aiming for a single optimal solution instead of multiple mediocre ones [9][10][11]. - The method introduces a reuse mechanism inspired by PUCT, maintaining historical attempts in a buffer to prioritize the most promising states while balancing exploration [12]. Group 3: Implementation and Results - The model generates a "private dataset" through continuous action generation and feedback reception, addressing the out-of-distribution (OOD) problem by creating data specific to the problem at hand [13][14]. - TTT-Discover's approach contrasts with traditional test-time search methods, which do not update model weights and thus do not enhance the model's capabilities [15][16]. - The algorithm involves a cycle of selecting potential solutions, generating new attempts, and evaluating results, with the model's weights updated after each iteration to improve performance [17][18][27]. Group 4: Performance Metrics - In experimental settings, TTT-Discover demonstrated a speed improvement of approximately 2 times compared to the best human implementations in kernel engineering tasks [27]. - The testing cost for a single problem is estimated to be several hundred dollars, showcasing the efficiency of the approach [27]. Group 5: Future Directions - TTT-Discover is primarily applicable to continuous reward scenarios, with future work needed to extend its capabilities to sparse, binary, and unverifiable reward problems [29].