策略蒸馏
Search documents
刚刚,Thinking Machines Lab博客提出在策略蒸馏,Qwen被cue 38次
3 6 Ke· 2025-10-28 02:00
Core Insights - Thinking Machines Lab (TML) has introduced a new training method called on-policy distillation, which combines reinforcement learning (RL) error correlation with supervised fine-tuning (SFT) reward density, achieving superior performance at a lower cost [1][17]. Group 1: Methodology and Applications - On-policy distillation is effective for small models, enhancing their domain performance and continuous learning capabilities [1][17]. - The method is inspired by the Qwen team’s research and heavily utilizes the Qwen3 series models during experiments [3][34]. - The training process consists of three stages: pre-training, mid-training, and post-training, focusing on general capabilities, domain knowledge, and target behavior respectively [6][7]. Group 2: Advantages of On-Policy Distillation - Small models trained with on-policy distillation often outperform larger general models in specialized fields due to benefits like local deployment, easier continuous training, and reduced inference costs [7][17]. - The method provides dense reward signals, allowing for more efficient learning compared to traditional RL, which offers sparse feedback [9][18]. Group 3: Performance and Cost Efficiency - TML's experiments show that on-policy distillation can achieve performance comparable to RL at a fraction of the cost, with reported costs being only one-tenth of traditional RL methods [34][41]. - The method has demonstrated significant computational efficiency, requiring 7-10 times fewer gradient steps to achieve similar performance levels as RL [58]. Group 4: Continuous Learning and Personalization - On-policy distillation is positioned as a promising tool for continuous learning, allowing models to update without degrading previously learned behaviors [66][70]. - The approach can effectively personalize models, enabling them to adapt to specific tasks while retaining core capabilities [42][53].