Workflow
模型条件优化偏差(model - conditioned optimization bias)
icon
Search documents
这些大神在Meta的论文看一篇少一篇了
量子位· 2025-11-17 04:52
Core Insights - The article discusses the recent research led by Tian Yuandong and his team on the dynamics of Reinforcement Learning with Verifiable Rewards (RLVR), revealing that despite significant performance improvements, only a small number of parameters are updated during training [2][4][5]. Group 1: Research Findings - The study identifies a misconception regarding the sparse parameter updates in RL training, suggesting that this sparsity is merely a surface phenomenon, with a deeper mechanism of model-conditioned optimization bias at play [4][10]. - The team introduced the Three-Gate Theory to explain how RL updates are constrained, guided, and filtered, leading to specific parameter regions being targeted for updates [6][11]. - The research highlights that RL training results in a high return with low parameter changes, contrasting with the dense updates seen in supervised fine-tuning (SFT) [8][9]. Group 2: Experimental Results - The analysis of various models, including Qwen series and DeepSeek-R1, showed that RL training led to parameter sparsity ranging from 36% to 92%, while SFT exhibited sparsity between 0.6% and 18.8% [9][10]. - The experiments confirmed that RLVR and SFT optimize different regions in the parameter space, with RL updates showing a strong tendency to avoid high-curvature areas, which are more sensitive to changes [18][20]. - The study also demonstrated that updating non-principal components and low-amplitude weights aligns with the theoretical predictions, allowing for better tracking of dense RLVR trajectories [27][28]. Group 3: Implications for Future Research - The findings suggest that many parameter-efficient fine-tuning (PEFT) methods from the SFT era may not transfer well to RLVR, particularly those aligned with sparse or low-rank priors [25][26]. - The research indicates that using higher learning rates in recent LoRA variants can lead to instability and premature collapse, as these methods tend to force updates along principal directions that RLVR avoids [29].