斯坦福英伟达推出测试时强化学习：微调开源模型胜过顶级闭源模型，仅需几百美元

Core Insights - The article discusses a new approach called Test-Time Training to Discover (TTT-Discover), which aims to solve open scientific problems by incorporating reinforcement learning during the testing phase of model evaluation [1][2]. Group 1: Methodology - TTT-Discover is based on the open-source model gpt-oss-120b and achieves state-of-the-art (SOTA) performance across multiple domains, outperforming human experts and closed-source models [3]. - Unlike traditional methods that rely on "Test-time Scaling" through prompt scheduling, TTT-Discover updates model weights during the testing phase to learn from specific problems [4][5]. - This "test-time training" allows the model to gain real-time experience from failed attempts, leading to a directed evolution of its capabilities [6]. Group 2: Learning Objectives - TTT-Discover employs an Entropic Objective, which focuses on maximizing the reward for the best actions rather than average rewards across all tasks, aiming for a single optimal solution instead of multiple mediocre ones [9][10][11]. - The method introduces a reuse mechanism inspired by PUCT, maintaining historical attempts in a buffer to prioritize the most promising states while balancing exploration [12]. Group 3: Implementation and Results - The model generates a "private dataset" through continuous action generation and feedback reception, addressing the out-of-distribution (OOD) problem by creating data specific to the problem at hand [13][14]. - TTT-Discover's approach contrasts with traditional test-time search methods, which do not update model weights and thus do not enhance the model's capabilities [15][16]. - The algorithm involves a cycle of selecting potential solutions, generating new attempts, and evaluating results, with the model's weights updated after each iteration to improve performance [17][18][27]. Group 4: Performance Metrics - In experimental settings, TTT-Discover demonstrated a speed improvement of approximately 2 times compared to the best human implementations in kernel engineering tasks [27]. - The testing cost for a single problem is estimated to be several hundred dollars, showcasing the efficiency of the approach [27]. Group 5: Future Directions - TTT-Discover is primarily applicable to continuous reward scenarios, with future work needed to extend its capabilities to sparse, binary, and unverifiable reward problems [29].