SFT远不如RL？永不过时的剃刀原则打开「终身学习」大模型训练的大门

Core Viewpoint - The article discusses the challenges and advancements in large models, particularly focusing on the phenomenon of catastrophic forgetting and the advantages of reinforcement learning (RL) over supervised fine-tuning (SFT) in mitigating this issue [1][3][29]. Group 1: Large Models and Their Challenges - The era of large models has arrived, becoming a core component of intelligent infrastructure supporting various applications such as language processing, visual analysis, and robotics [1]. - Most deployed large models are "static" and lack the ability for dynamic learning and self-improvement, which is essential for achieving more general artificial intelligence (AGI) [2][3]. - Catastrophic forgetting occurs when models lose previously learned skills while learning new tasks, posing a significant challenge for long-term learning agents [3]. Group 2: Research Insights on Catastrophic Forgetting - Researchers have proposed various methods to address catastrophic forgetting, including regularization, experience replay, and parameter tuning [5]. - A recent study from MIT's Improbable AI Lab revealed fundamental patterns and training strategies related to forgetting in large models, gaining significant attention [6][7]. Group 3: Findings from the Study - The study compared two common post-training methods: supervised fine-tuning (SFT) and reinforcement learning (RL), finding that RL is less prone to forgetting [8][29]. - A new principle called the "forgetting law" was introduced, indicating that the KL divergence between the fine-tuned strategy and the baseline strategy is a key predictor of forgetting [10][30]. - The research demonstrated that RL maintains better retention of prior knowledge while learning new tasks compared to SFT, which often sacrifices old knowledge for new performance [15][29]. Group 4: Mechanisms and Theoretical Contributions - The study identified that the online nature of RL contributes to its KL divergence minimization, which helps retain prior knowledge [21][30]. - The authors provided a theoretical basis for RL's KL-minimizing behavior, explaining that RL naturally prefers solutions closer to the original model [24][30]. - The findings suggest that future training methods should aim to minimize KL divergence to achieve continuous learning without forgetting [31][32].