SimKO：缓解RLVR训练中的概率过度集中，优化pass@K性能

Core Insights - The article discusses the limitations of existing Reinforcement Learning with Verified Rewards (RLVR) methods in enhancing the performance of large language models, particularly in terms of pass@K metrics, which show a decline compared to base models despite improvements in pass@1 performance [2][3][12]. Group 1: Problem Analysis - The decline in exploration capability of RLVR methods is attributed to the models concentrating probabilities on a single reasoning path, thus sacrificing the ability to explore diverse correct solutions [3][12]. - Current RLVR algorithms, such as GRPO and DAPO, reinforce the probability of correct answers while punishing incorrect ones, leading to a concentration of probability on rank-1 candidates and inhibiting exploration of other potential correct paths [8][23]. - The use of entropy as a diversity metric is limited, as it does not accurately reflect the shape of the probability distribution, which can lead to misleading conclusions about the model's exploration capabilities [9][12]. Group 2: Proposed Solution - The research team introduces SimKO (Simple Pass@K Optimization), a new algorithm designed to improve pass@K performance by addressing the issue of probability concentration [4][17]. - SimKO employs an asymmetric gradient adjustment strategy, applying label smoothing to correct paths while imposing precise penalties on incorrect paths, thus balancing exploration and exploitation [17][23]. - The algorithm identifies key tokens with high entropy in reasoning paths, applying updates only to these critical nodes to enhance the model's exploration capabilities [18][20]. Group 3: Experimental Results - SimKO was evaluated on multiple mathematical reasoning benchmarks, demonstrating significant improvements in pass@K performance while maintaining or slightly enhancing pass@1 accuracy [21][27]. - In comparison to GRPO, SimKO showed a 31.6% increase in pass@1 and a 26.3% increase in pass@128 on in-distribution tasks, while also performing well on out-of-distribution tasks [27][26]. - The results indicate that SimKO effectively mitigates the issue of probability concentration, thereby enhancing the model's exploration ability and improving overall performance metrics [26][27].