用SFT打出RL的效果？微软联合提出高效后训练算法

Core Insights - The article discusses the importance of Supervised Fine-Tuning (SFT) and Reinforcement Learning (RL) in the post-training phase of large models, highlighting their respective strengths and weaknesses [2] - A new approach, "Towards On-Policy SFT," is proposed to combine the advantages of SFT and RL by generating On-policy data and training efficiently [3] Group 1: On-Policy Data and Its Measurement - On-policy data is defined as data generated by the model using its current capabilities, contrasting with Off-policy data, which is derived from external sources [4] - Traditional metrics like Perplexity (PPL) and Log-Likelihood are insufficient for measuring the distribution shift between On-policy and Off-policy data due to noise from problem difficulty [6] - The article introduces a new quantification metric, Centered Log-Likelihood (CLL), which separates the noise and provides a clearer distinction between data sources [7] Group 2: Challenges of Supervised Fine-Tuning - SFT operates under the assumption that every word in the training set is an absolute truth, leading to severe penalties for prediction errors, which can cause catastrophic forgetting [12][13] - The article proposes In-Distribution Fine-Tuning (IDFT) as a solution to mitigate the issues caused by rigid fitting and noise in training data [14][17] Group 3: Hinted Decoding and Data Transformation - Hinted Decoding is introduced as a method to convert datasets into On-policy versions by allowing the model to rewrite examples while maintaining its style [20] - The approach involves switching between Self-distillation and normal training based on the entropy of the Teacher model, which improves the model's distribution metrics [22] Group 4: Experimental Results - The new methods proposed in the article outperform well-known Offline RL algorithms while using significantly fewer resources [25] - The results indicate that the adaptive switching mechanism based on entropy is crucial for achieving better performance [25] Group 5: Broader Implications - The work has potential applications across various fields, including CoT completion and On-policy Distillation, indicating its relevance beyond the immediate context [28]