Nvidia-斯坦福英伟达推出测试时强化学习：微调开源模型胜过顶级闭源模型，仅需几百美元

Core Insights - The article discusses a new approach called Test-Time Training to Discover (TTT-Discover) developed by researchers from Stanford and NVIDIA, aimed at solving open scientific problems through real-time learning during the testing phase [1][2]. Group 1: Methodology - TTT-Discover is based on the open-source model gpt-oss-120b, achieving state-of-the-art (SOTA) results across multiple fields, outperforming human experts and closed-source models [2]. - Unlike traditional methods that rely on "Test-time Scaling" and prompt scheduling, TTT-Discover employs reinforcement learning (RL) to update model weights during testing, allowing the model to learn from failures in real-time [2][5]. - The approach introduces an entropy objective function, focusing on generating a single optimal solution rather than multiple mediocre ones, which contrasts with traditional RL that aims for average rewards [3][7]. Group 2: Implementation - TTT-Discover maintains a buffer of historical attempts, prioritizing the expansion of the most promising states while balancing exploration and exploitation [4]. - The model generates actions based on feedback from the environment, creating a "private dataset" for specific problems, thus addressing the out-of-distribution (OOD) challenge [5]. - The algorithm utilizes a strategy to select the most promising existing solutions as starting points, generates new attempts, evaluates results, and updates model weights accordingly [8]. Group 3: Performance and Applications - In experimental settings, TTT-Discover demonstrated a speed advantage, being approximately twice as fast as the best human implementations in kernel engineering tasks [10]. - The method is particularly effective in continuous (verifiable) reward scenarios, with future work planned to extend its applicability to sparse rewards, binary rewards, and unverified domains [10]. Group 4: Authors and Research Background - The primary authors of the paper are Mert Yuksekgonul and Daniel Koceja, with Yuksekgonul pursuing a PhD at Stanford University [11][13]. - Yu Sun, the corresponding author, is a postdoctoral researcher at Stanford and a researcher at NVIDIA, focusing on continual learning and test-time training since 2019 [14][16].