Workflow
风险度量优化
icon
Search documents
北大彭一杰教授课题组提出RiskPO,用风险度量优化重塑大模型后训练
机器之心· 2025-10-15 02:54
Core Insights - The article discusses the limitations of traditional reinforcement learning (RL) methods in enhancing the reasoning capabilities of large models, particularly highlighting the "mean optimization trap" that leads to a lack of exploration and ineffective learning in challenging tasks [4][24]. - A new approach called RiskPO is introduced, which integrates risk-averse principles into the optimization objective, focusing on the left tail of the reward distribution to guide models in overcoming reasoning shortcomings [7][24]. Research Background and Challenges - The article outlines the challenges faced by large models in post-training, particularly the "mean optimization trap" that results in a loss of exploration ability and ineffective learning in difficult tasks [4][24]. - It emphasizes that existing methods, such as GRPO, have improved short-term metrics but have not expanded the reasoning boundaries necessary for complex tasks [4][24]. Technical Solution Overview - The RiskPO approach combines "risk measurement" with a bundling strategy to address the shortcomings of traditional mean optimization [6][7]. - The core of this approach is the "Mixed Value at Risk (MVaR)" objective function, which emphasizes the importance of low-reward, difficult tasks by replacing the pursuit of overall mean rewards [9][10]. Experimental Results - The North University team demonstrated the effectiveness of RiskPO across various tasks, achieving significant improvements in reasoning capabilities, particularly in challenging problems [15][18]. - In the AIME24 competition, RiskPO outperformed GRPO by nearly 7 percentage points in Pass@32 and achieved a Pass@1 score of 81.8% on the MATH500 dataset, surpassing GRPO by 2.6 percentage points [15][16]. Theoretical Support and Validation - The performance improvements of RiskPO are backed by solid theoretical foundations and rigorous ablation studies, showing that risk-averse updates can effectively mitigate entropy collapse [20][21]. - The article highlights that while mean-based metrics may show similar performance in early training, risk-sensitive metrics reveal significant advantages for RiskPO as training progresses [23][24]. Comparison with Alternative Strategies - A comparison with risk-seeking strategies demonstrated that focusing on easier tasks leads to rapid entropy collapse and stagnation in performance, while risk-averse strategies drive continuous improvement [26][27].