2026开年关键词：Self-Distillation，大模型真正走向「持续学习」

Core Insights - The article discusses the emerging consensus among researchers in the large language model (LLM) field regarding the concept of Self-Distillation as a solution to the challenges of continual learning in AI models [3][4]. Group 1: Self-Distillation in Continual Learning - Traditional supervised fine-tuning (SFT) is criticized for causing "catastrophic forgetting," where new knowledge acquisition leads to a significant drop in existing capabilities [7]. - The proposed Self-Distillation Fine-Tuning (SDFT) method allows models to learn from demonstrations while maintaining their original capabilities, thus addressing the catastrophic forgetting issue [11]. - SDFT has shown superior performance in skill learning and knowledge acquisition tasks, achieving higher accuracy on new tasks and significantly reducing catastrophic forgetting [14]. Group 2: Reinforcement Learning via Self-Distillation - Current reinforcement learning methods often rely on binary feedback, which can lead to severe "credit assignment" problems and stagnation in model evolution [16]. - The Self-Distillation Policy Optimization (SDPO) framework introduces a "rich feedback" environment that transforms vague scalar rewards into dense supervision signals, enhancing learning efficiency [19]. - SDPO demonstrates a significant improvement in sampling efficiency, requiring only about one-third of the attempts to achieve the same discovery rate as traditional algorithms [21]. Group 3: On-Policy Self-Distillation for Large Language Models - The OPSD framework addresses the challenges of large models in complex reasoning tasks by creating "information asymmetry" within the model to guide self-evolution [23][25]. - OPSD achieves high learning efficiency, outperforming traditional algorithms in token utilization by 4-8 times in challenging reasoning benchmarks [27]. - The three papers collectively emphasize leveraging existing model capabilities through context construction to achieve self-driven upgrades, positioning Self-Distillation as a standard configuration in post-training phases for large models [27].