Core Insights - The article discusses the innovative approach of "Test-Time Training to Discover" (TTT-Discover) which allows large language models (LLMs) to improve their performance on specific tasks through reinforcement learning during the testing phase, rather than relying solely on pre-trained knowledge [2][3][4]. Group 1: Methodology - TTT-Discover defines a single test problem as an environment where standard reinforcement learning techniques can be applied, focusing on producing a single optimal solution rather than averaging multiple solutions [3][9]. - The method incorporates two key components: an entropy objective function that biases towards high-reward samples and a state reuse strategy inspired by PUCT, which prioritizes the most promising paths during the search process [9][10]. Group 2: Results and Achievements - TTT-Discover has shown significant success across various tasks, outperforming existing models such as DeepMind's AlphaEvolve and achieving new records in mathematical problems and GPU kernel optimization [3][11][13]. - In the Erdős minimum overlap problem, TTT-Discover achieved a score of 0.380876, surpassing the previous best human score of 0.380927 and the AI best of 0.380924 [11][12]. - The TriMul kernel developed using TTT-Discover was found to be 50% faster than the best human submission on A100 GPUs, with an overall performance improvement of over 15% across all GPU types compared to human best results [13][14]. Group 3: Future Directions - The team acknowledges that TTT-Discover is currently limited to problems with continuous rewards and aims to extend its applicability to areas with sparse or binary rewards, such as mathematical proofs and scientific hypotheses [17].
比人类专家快2倍,斯坦福联合英伟达发布TTT-Discover:用「测试时强化学习」攻克科学难题