ROVER
Search documents
港科提出新算法革新大模型推理范式:随机策略估值竟成LLM数学推理「神操作」
3 6 Ke· 2025-10-31 08:28
Core Insights - The article discusses the introduction of ROVER (Random Policy Valuation for Diverse Reasoning), a novel approach to enhance reasoning capabilities in large language models (LLMs) through a simplified reinforcement learning framework [1][2][5]. Group 1: ROVER's Methodology - ROVER simplifies the traditional reinforcement learning process by eliminating the need for policy iteration, relying instead on the value assessment of a completely random policy to identify optimal reasoning paths [1][5][7]. - The algorithm operates in three main steps: estimating Q-values, constructing policies using softmax sampling to maintain diversity, and implementing a training objective that integrates rewards into the LLM parameters without requiring an additional value network [11][12][13]. Group 2: Performance Metrics - ROVER significantly outperforms existing methods on various mathematical reasoning benchmarks, achieving notable improvements in pass rates, such as a +8.2 increase in pass@1 and a +16.8 increase in pass@256 [5][15]. - The diversity of strategies generated by ROVER is enhanced by 17.6% compared to baseline methods, allowing for a broader exploration of problem-solving paths [17][20]. Group 3: Experimental Results - In specific tasks like AIME24 and HMMT25, ROVER's pass@1 scores reached 30.6 and 14.6 respectively, marking substantial increases over the best baseline scores [15][16]. - ROVER's ability to discover new solution strategies is illustrated by its performance in generating multiple reasoning paths for complex problems, showcasing its effectiveness in diverse reasoning scenarios [20][22]. Group 4: Implications and Future Directions - The introduction of ROVER represents a paradigm shift in the approach to structured tasks, emphasizing that simplicity can lead to enhanced performance in AI applications [23].
港科提出新算法革新大模型推理范式:随机策略估值竟成LLM数学推理「神操作」
机器之心· 2025-10-31 04:11
Core Insights - The article discusses the introduction of ROVER (Random Policy Valuation for Diverse Reasoning), a novel approach that simplifies the reasoning process in large language models (LLMs) by evaluating a completely random policy to find optimal reasoning paths, thus bypassing traditional reinforcement learning (RL) iterations [3][4][11]. Group 1: ROVER's Methodology and Advantages - ROVER significantly outperforms existing methods on various mathematical reasoning benchmarks, achieving higher quality and diversity in reasoning generation through a minimalist approach [4][9]. - The algorithm eliminates the need for maintaining a value network or a reference model, making it more lightweight compared to traditional RL methods [9][16]. - ROVER's process consists of three simple steps: estimating Q-values, constructing policies using softmax sampling to maintain diversity, and implementing a training objective that reduces computational load and enhances stability [19][21][24]. Group 2: Performance Metrics - In high-difficulty tasks such as AIME24, AIME25, and HMMT25, ROVER improved pass@1 by +8.2 and pass@256 by +16.8, showcasing its superior performance [9][26]. - ROVER achieved a pass@1 score of 30.6 on AIME24, surpassing the best baseline (DAPO) by 19.1 points, and a pass@1 score of 14.6 on HMMT25, representing a 106% increase from the highest baseline [26][27]. - The diversity of strategies generated by ROVER is enhanced by 17.6% compared to baselines, allowing it to cover more problem-solving paths [29][31]. Group 3: Implications and Future Directions - The introduction of ROVER reflects a methodological shift, emphasizing that simplification rather than complexity can drive performance improvements in structured tasks [38].