Workflow
KL散度
icon
Search documents
Thinking Machine新研究刷屏!结合RL+微调优势,小模型训练更具性价比了
量子位· 2025-10-28 01:18
Core Insights - The article discusses the innovative research by Thinking Machine, focusing on a new training method for small language models called On-Policy Distillation, which enhances their understanding of specialized fields [1][4]. Summary by Sections Methodology - On-Policy Distillation combines the strengths of two traditional training methods: reinforcement learning (self-exploration) and supervised fine-tuning (direct answers), creating a more efficient training framework [3][8]. - This method allows AI to learn through practical problem-solving while receiving immediate guidance when it encounters difficulties, significantly improving training efficiency by 50-100 times [4][5]. Training Phases - The training process consists of three main phases: Pre-training (general capabilities), Mid-training (domain-specific knowledge), and Post-training (target behavior guidance) [9]. - The focus of the research is on the Post-training phase, where the model learns to perform specific tasks effectively [6][9]. Evaluation Metrics - The method employs Negative reverse KL divergence as a key evaluation metric, ensuring that the student model learns effectively by minimizing the divergence from the teacher model's expectations [12][15]. Experimental Results - Experiment 1 demonstrated that using On-Policy Distillation, a smaller model (8B) could achieve a performance score of 70% on a math benchmark with significantly lower computational costs compared to traditional methods [19][22]. - Experiment 2 showed that the method effectively mitigates "catastrophic forgetting" in AI models, allowing them to retain general capabilities while learning new knowledge [23][25]. Implications - The research indicates that On-Policy Distillation can empower resource-constrained individuals or small companies to train effective specialized models, enhancing accessibility in AI development [5][19]. - The findings suggest a promising avenue for achieving lifelong learning in AI systems, addressing the challenge of balancing new knowledge acquisition with the retention of existing skills [26].
微软副总裁X上「开课」,连更关于RL的一切,LLM从业者必读
机器之心· 2025-05-26 01:28
Core Viewpoint - The article discusses the educational series on artificial intelligence initiated by Nando de Freitas, focusing on reinforcement learning (RL) and its applications in large language models (LLMs) [1][2]. Summary by Sections Introduction to AI Education - Nando de Freitas aims to educate readers on AI through a series of posts on X, starting with reinforcement learning and gradually covering diffusion and flow matching technologies [1][2]. Learning Types - The article highlights that there is no ultimate conclusion on unsupervised learning, supervised learning, and reinforcement learning [8][19]. - Supervised learning is described as basic imitation, requiring high-quality expert data for effective learning [9]. - Reinforcement learning focuses on selective imitation, allowing agents to learn from suboptimal experiences and improve their performance [10][11]. Distributed Reinforcement Learning Systems - Modern distributed RL systems consist of two main components: Actors and Learners, where Actors interact with the environment and collect data, while Learners update the policy network based on this data [23][24]. - The importance of measuring operational durations and communication bandwidth in such systems is emphasized [24][27]. Offline Reinforcement Learning - Offline RL has unique value in scenarios like post-training LLMs, where it can leverage historical data for learning [28][29]. Single-step and Multi-step RL - The article differentiates between single-step and multi-step RL problems, with single-step focusing on immediate actions and multi-step involving planning over a series of interactions [35][39]. - The complexity of multi-step RL is noted, particularly in credit assignment issues where multiple decisions affect outcomes [40][41]. Policy Gradient and Techniques - Policy gradient methods are discussed, including the use of baseline subtraction to reduce variance in reward signals [49][56]. - The article also covers the significance of KL divergence in maintaining proximity to supervised fine-tuning strategies during post-training [69]. Importance Sampling and PPO - Importance sampling is introduced as a method to correct off-policy sample bias, with Proximal Policy Optimization (PPO) being a key technique to manage policy updates [73][78]. - The integration of various techniques in training models like DeepSeek-R1 is highlighted, showcasing the complexity of modern RL systems [81]. Future Directions - Freitas plans to expand the discussion from single-step to multi-step RL, indicating ongoing developments in the field [82].