Workflow
优势估计偏差
icon
Search documents
揭秘!RLVR/GRPO中那些长期被忽略的关键缺陷
机器之心· 2026-01-30 08:49
Core Insights - The article discusses the advancements in large models, particularly focusing on the Reinforcement Learning with Verifiable Rewards (RLVR) technique, which allows models to improve through self-exploration rather than relying on external scoring [2] - A critical issue identified is the systematic bias in group-relative advantage estimation, where difficult problems are consistently underestimated while simple problems are overestimated [3][5] - The findings suggest that this bias is not merely a result of sampling noise but is inherent to the statistical structure of group-relative advantage estimation [6][23] Group-relative Advantage Estimation - Group-relative advantage estimation involves sampling multiple responses for a given prompt and calculating the average reward as a baseline for evaluating individual responses [9][10] - The expected reward and advantage are defined mathematically, with the expected advantage being the difference between the response reward and the expected reward [12][14] - The difficulty of a problem is categorized based on its expected reward, with values below 0.5 considered difficult and those above 0.5 considered simple [16] Systematic Bias in Estimation - The article presents a theorem indicating that group-relative advantage estimation exhibits systematic bias based on the difficulty of prompts, leading to underestimation for difficult prompts and overestimation for simple ones [23][30] - Visual analysis shows that the bias in advantage estimation increases as the difficulty of the prompt deviates from 0.5, with smaller sample sizes exacerbating the bias [24][25] - Specific examples illustrate how the estimation bias can significantly misrepresent the true advantages, particularly in challenging scenarios [26][28] Implications for RLVR Training - The systematic bias in group-relative advantage estimation can lead to imbalanced gradient signals during training, hindering effective exploration and favoring simple samples over challenging ones [40] - The article proposes an adaptive adjustment mechanism for advantage estimation based on prompt difficulty, suggesting that difficult prompts should have their estimated advantages amplified to encourage exploration, while simple prompts should have their advantages suppressed [40][42] - The introduction of the HA-DW algorithm aims to dynamically assess prompt difficulty and adjust advantage estimation accordingly, leading to improved performance on difficult prompts [41][42]