随机策略估值
Search documents
港科提出新算法革新大模型推理范式:随机策略估值竟成LLM数学推理「神操作」
3 6 Ke· 2025-10-31 08:28
Core Insights - The article discusses the introduction of ROVER (Random Policy Valuation for Diverse Reasoning), a novel approach to enhance reasoning capabilities in large language models (LLMs) through a simplified reinforcement learning framework [1][2][5]. Group 1: ROVER's Methodology - ROVER simplifies the traditional reinforcement learning process by eliminating the need for policy iteration, relying instead on the value assessment of a completely random policy to identify optimal reasoning paths [1][5][7]. - The algorithm operates in three main steps: estimating Q-values, constructing policies using softmax sampling to maintain diversity, and implementing a training objective that integrates rewards into the LLM parameters without requiring an additional value network [11][12][13]. Group 2: Performance Metrics - ROVER significantly outperforms existing methods on various mathematical reasoning benchmarks, achieving notable improvements in pass rates, such as a +8.2 increase in pass@1 and a +16.8 increase in pass@256 [5][15]. - The diversity of strategies generated by ROVER is enhanced by 17.6% compared to baseline methods, allowing for a broader exploration of problem-solving paths [17][20]. Group 3: Experimental Results - In specific tasks like AIME24 and HMMT25, ROVER's pass@1 scores reached 30.6 and 14.6 respectively, marking substantial increases over the best baseline scores [15][16]. - ROVER's ability to discover new solution strategies is illustrated by its performance in generating multiple reasoning paths for complex problems, showcasing its effectiveness in diverse reasoning scenarios [20][22]. Group 4: Implications and Future Directions - The introduction of ROVER represents a paradigm shift in the approach to structured tasks, emphasizing that simplicity can lead to enhanced performance in AI applications [23].