On-policy Distillation
Search documents
理解 RL学习的本质!
自动驾驶之心· 2025-12-15 00:04
Core Viewpoint - The article discusses the limitations of Reinforcement Learning (RL) in enhancing the reasoning capabilities of Large Language Models (LLMs), emphasizing that RL does not extend the inherent capabilities of the models but rather improves search efficiency within existing boundaries [4][5][7]. Group 1: RL Learning Limitations - A recent paper from Tsinghua's LEAP lab concluded that RL learning does not enable LLMs to surpass the reasoning abilities of their base models, as RL models only improve search efficiency without solving problems that base models cannot [4][5]. - The evaluation method used, pass@K, showed that while RL models perform better than base models at K=1, their performance converges as K increases, eventually being surpassed by base models at larger K values [4][7]. - RL models exhibit a polarized accuracy distribution, performing well on high-accuracy tasks but poorly on low-accuracy ones, indicating a tendency to excel in specific areas while failing in others [8][9]. Group 2: Comparison with Distillation Learning - Unlike RL, Distillation Learning (SFT) can expand a model's capabilities, allowing it to learn to solve problems it previously could not address [12]. - The limitations of RL are attributed to a "double-edged sword" effect of pre-training priors, which restrict exploration and reinforce existing solutions rather than discovering new paths [14][15]. - The article suggests that a balance between exploration and exploitation in training methods could enhance model performance without narrowing the exploration range [15]. Group 3: Parameter Update Characteristics - A paper from Meta explains that RL training features localized parameter updates, which leads to a consistent optimization bias that limits exploration [18][21]. - The "three gates" theory describes how RL imposes constraints on updates, preventing significant deviations from the model's original distribution and avoiding high-curvature directions in parameter space [21][22][23]. - The observed sparsity in RL updates is a result of low-precision parameter representations filtering out minor updates, rather than an actual lack of updates [23]. Group 4: Catastrophic Forgetting and Trade-offs - The article highlights the issue of catastrophic forgetting in SFT training, which RL training can mitigate, leading to a trade-off between learning new skills and avoiding forgetting [30][31]. - A comparison table illustrates that while RL cannot learn new capabilities, it can avoid catastrophic forgetting, suggesting a potential conflict between these two objectives [34]. - Recent research proposes a hybrid approach called On-policy Distillation, which combines elements of RL and SFT, potentially allowing for both new skill acquisition and the prevention of forgetting [36].