Workflow
监督微调(SFT)
icon
Search documents
用SFT打出RL的效果?微软联合提出高效后训练算法
机器之心· 2026-03-25 07:44
Core Insights - The article discusses the importance of Supervised Fine-Tuning (SFT) and Reinforcement Learning (RL) in the post-training phase of large models, highlighting their respective strengths and weaknesses [2] - A new approach, "Towards On-Policy SFT," is proposed to combine the advantages of SFT and RL by generating On-policy data and training efficiently [3] Group 1: On-Policy Data and Its Measurement - On-policy data is defined as data generated by the model using its current capabilities, contrasting with Off-policy data, which is derived from external sources [4] - Traditional metrics like Perplexity (PPL) and Log-Likelihood are insufficient for measuring the distribution shift between On-policy and Off-policy data due to noise from problem difficulty [6] - The article introduces a new quantification metric, Centered Log-Likelihood (CLL), which separates the noise and provides a clearer distinction between data sources [7] Group 2: Challenges of Supervised Fine-Tuning - SFT operates under the assumption that every word in the training set is an absolute truth, leading to severe penalties for prediction errors, which can cause catastrophic forgetting [12][13] - The article proposes In-Distribution Fine-Tuning (IDFT) as a solution to mitigate the issues caused by rigid fitting and noise in training data [14][17] Group 3: Hinted Decoding and Data Transformation - Hinted Decoding is introduced as a method to convert datasets into On-policy versions by allowing the model to rewrite examples while maintaining its style [20] - The approach involves switching between Self-distillation and normal training based on the entropy of the Teacher model, which improves the model's distribution metrics [22] Group 4: Experimental Results - The new methods proposed in the article outperform well-known Offline RL algorithms while using significantly fewer resources [25] - The results indicate that the adaptive switching mechanism based on entropy is crucial for achieving better performance [25] Group 5: Broader Implications - The work has potential applications across various fields, including CoT completion and On-policy Distillation, indicating its relevance beyond the immediate context [28]
“终身自学习”AI来了,MIT提出自蒸馏微调SDFT,从此告别灾难性遗忘
3 6 Ke· 2026-02-02 11:40
Core Insights - The article discusses a novel approach called Self-Distillation Fine-Tuning (SDFT) developed by a team from MIT, which enables AI models to learn new skills while retaining existing knowledge, achieving near "zero forgetting" capability [1][9]. Group 1: SDFT Methodology - SDFT addresses the challenges of continuous learning in AI by transforming static demonstrations into dynamic in-policy training signals, allowing models to improve on new tasks without degrading existing capabilities [4][7]. - The method utilizes the model's own contextual learning abilities, where the model acts as both a "teacher" and a "student" during training, minimizing the divergence between the outputs of the two roles [4][7]. Group 2: Experimental Validation - Experiments demonstrated that SDFT outperformed traditional Supervised Fine-Tuning (SFT) in tasks such as scientific question answering, tool usage, and medical reasoning, showcasing superior in-distribution generalization [8][11]. - In multi-task continuous learning scenarios, SDFT allowed a single model to accumulate skills without performance degradation, while SFT exhibited significant interference, leading to rapid declines in earlier skills when transitioning to new tasks [8][11]. Group 3: Performance Metrics - In knowledge acquisition tasks, SDFT achieved an accuracy of 89%, surpassing SFT's 80%, and performed nearly as well as ideal retrieval-augmented generation (RAG) systems [11]. - SDFT maintained high performance on out-of-distribution problems requiring new knowledge integration, while SFT lagged significantly, indicating SDFT's ability to incorporate new knowledge into internal representations rather than mere memorization [11]. Group 4: Advantages and Limitations - The effectiveness of SDFT increases with model size, as larger models exhibit stronger contextual learning capabilities, providing better guidance signals for self-distillation [12][14]. - SDFT incurs approximately 2.5 times the computational cost of traditional supervised fine-tuning due to real-time generation and learning requirements, but it often achieves superior overall performance in shorter total training times compared to multi-stage methods [16]. Group 5: Future Directions - SDFT's current limitations include dependency on the model's contextual learning ability, potential language artifacts from the teacher model, and challenges in tasks requiring a complete change in generation patterns [18]. - Future exploration may involve deeper integration of SDFT with reinforcement learning, development of techniques to further mitigate forgetting, and expansion to more complex and realistic continuous learning scenarios [18].
华为推出软工代码智能体SWE-Lego,解锁SFT训练极致性能
机器之心· 2026-01-13 04:08
Core Insights - The article discusses the introduction of SWE-Lego by Huawei's research team, a software engineering code model based solely on supervised fine-tuning (SFT), which achieves state-of-the-art (SOTA) performance without the complexities of reinforcement learning (RL) [2][5][43]. Group 1: Challenges and Motivation - Software engineering tasks require complex capabilities such as long-sequence reasoning, multi-file operations, and tool usage, which existing training methods struggle to address due to high computational costs and data scarcity [4][9]. - Traditional methods often involve complex training paradigms, including RL, which increases training complexity and costs, making it challenging for smaller teams [5][9]. Group 2: Three Core Components of SWE-Lego - **Hybrid Dataset Construction**: SWE-Lego's dataset comprises 32,119 high-quality task instances and 18,110 validation trajectories, utilizing a mix of real-world data from GitHub pull requests and synthetic data generated by introducing bugs into code [14][17]. - **Improved Supervised Fine-Tuning**: SWE-Lego employs two key improvements: step-level error masking, which allows the model to learn only from correct steps, and difficulty-based curriculum learning, which gradually increases task complexity [26][28]. - **Testing Time Extension (TTS)**: TTS enhances performance during testing by allocating additional computational resources, with a focus on serial versus parallel expansion strategies and the use of generative scoring over regression scoring [34][40]. Group 3: Performance Metrics and Results - SWE-Lego-Qwen3-8B and SWE-Lego-Qwen3-32B achieved performance scores of 42.2% and 52.6% respectively, surpassing many larger closed-source models [5][13]. - The hybrid dataset contributed the most to performance improvement, accounting for a 25.6% increase, while the improved SFT and TTS contributed 3.8% and 6.2% respectively, leading to a total performance increase of 35.6 percentage points [13][25]. Group 4: Future Directions - The article concludes that SWE-Lego demonstrates that lightweight methods can achieve SOTA performance without complex RL or iterative training, emphasizing the importance of data quality and strict validation [43]. Future explorations will focus on larger models, additional programming languages, and real-world software development processes [43].
SFT 还是RL,VLA到底应该如何训练?
具身智能之心· 2025-10-28 00:02
Core Insights - The articles focus on advancements in Reinforcement Learning (RL) and its application to Visual-Language-Action (VLA) models, highlighting significant improvements in generalization capabilities and training efficiency. Group 1: Research Findings - The first study investigates how RL enhances the generalization ability of VLA models, addressing issues related to supervised fine-tuning (SFT) that lead to error accumulation and distribution shift. A new benchmark covering visual, semantic, and execution dimensions was established, showing that using Proximal Policy Optimization (PPO) for RL fine-tuning significantly improves semantic understanding and execution robustness while maintaining comparable visual generalization performance to SFT [2]. - The second study introduces RLinf-VLA, a framework designed for large-scale RL training of VLA models. It proposes a novel solution to the challenges of integrating RL and VLA training, achieving up to 2.27 times acceleration compared to baseline methods. The framework supports various VLA architectures and RL algorithms, achieving a 98.11% success rate across 130 LIBERO tasks [3]. Group 2: Practical Applications - RLinf-VLA summarizes best practices for applying RL in VLA training, providing a unified interface that facilitates the use of multiple VLA architectures and simulators, thus lowering the barrier for implementing RL in large-scale VLA applications [3]. - The research emphasizes the importance of RL in enhancing the performance of VLA models, suggesting a shift towards more efficient training methodologies that leverage RL's strengths [15].
RLINF-VLA:一种用于 VLA+RL 训练的统一高效框架
具身智能之心· 2025-10-22 06:02
Core Insights - The article presents the RLinf-VLA framework, a unified and efficient framework for training Visual-Language-Action (VLA) models using Reinforcement Learning (RL), addressing the limitations of existing models that rely on supervised fine-tuning [2][53] - The framework significantly enhances training efficiency and generalization capabilities, achieving high success rates in various simulation tasks and demonstrating superior performance in real-world applications compared to traditional supervised methods [5][53] Framework Design - The RLinf-VLA framework integrates multiple simulators, algorithms, and VLA architectures, optimizing resource allocation through flexible execution modes and system-level enhancements [4][53] - It supports three GPU allocation strategies: colocated, disaggregated, and hybrid, allowing users to easily switch modes via configuration files, thus reducing system customization costs [10][11] Model Compatibility - The framework supports LoRA for efficient parameter tuning, reducing memory consumption and accelerating training while maintaining performance [12] - It is compatible with OpenVLA and its extension OpenVLA-OFT, which have shown strong performance in various robotic operation benchmarks [12][22] Multi-Simulator Support - The framework emphasizes the importance of simulators in RL, utilizing ManiSkill and LIBERO as primary simulators to achieve diverse task capabilities [13] - It provides a unified interface for different simulators, facilitating the implementation of various tasks and supporting multiple RL algorithms, initially focusing on PPO and GRPO [13][14] Algorithm Design - The framework incorporates advanced techniques for advantage function and log-probability calculations, allowing for flexible integration of block-level and action-level definitions [14][15] - It supports various optimization strategies, including trajectory length normalization and effective action masking, to enhance training stability and performance [19][20] Experimental Results - The RLinf-VLA framework demonstrated significant performance improvements, with success rates increasing by 45% to 70% in various tasks compared to baseline models [22][24] - In LIBERO tasks, the framework achieved an average success rate of 98.11%, showcasing its capability for large-scale multi-task reinforcement learning [28] High Efficiency Performance - The framework's efficiency is evaluated based on throughput, achieving substantial improvements in training speed across different GPU configurations [30][35] - The hybrid allocation mode outperformed traditional methods, demonstrating the benefits of pipeline overlapping in resource utilization [35][37] Real-World Deployment - The RLinf-VLA framework was successfully deployed in real-world environments, showing superior zero-shot generalization capabilities compared to supervised fine-tuning strategies [51][53] - The experiments indicated that RL-trained models could adapt better to real-world tasks, achieving higher success rates in object manipulation tasks [51] Conclusion - The RLinf-VLA framework represents a significant advancement in the field of embodied intelligence, providing a robust foundation for future research and development in VLA training [53]
听说,大家都在梭后训练?最佳指南来了
机器之心· 2025-10-09 02:24
Core Insights - The article emphasizes the shift in focus from pre-training to post-training in large language models (LLMs), highlighting the diminishing returns of scaling laws as model sizes reach hundreds of billions of parameters [2][3][11]. Group 1: Importance of Post-Training - Post-training is recognized as a crucial phase for enhancing the reasoning capabilities of models like OpenAI's series, DeepSeek R1, and Google Gemini, marking it as a necessary step towards advanced intelligence [3][11]. - The article introduces various innovative post-training methods such as Reinforcement Learning from Human Feedback (RLHF), Reinforcement Learning from AI Feedback (RLAIF), and Reinforcement Learning with Verifiable Rewards (RLVR) [2][3][12]. Group 2: Transition from Pre-Training to Post-Training - The evolution from pre-training to instruction fine-tuning is discussed, where foundational models are trained on large datasets to predict the next token, but often lack practical utility in real-world applications [7][8]. - Post-training aims to align model behavior with user expectations, focusing on quality over quantity in the datasets used, which are typically smaller but more refined compared to pre-training datasets [11][24]. Group 3: Supervised Fine-Tuning (SFT) - Supervised Fine-Tuning (SFT) is described as a process that transforms a pre-trained model into one that can follow user instructions effectively, relying on high-quality instruction-answer pairs [21][24]. - The quality of the SFT dataset is critical, as even a small number of low-quality samples can negatively impact the model's performance [25][26]. Group 4: Reinforcement Learning Techniques - Reinforcement Learning (RL) is highlighted as a complex yet effective method for model fine-tuning, with various reward mechanisms such as RLHF, RLAIF, and RLVR being employed to enhance model performance [39][41]. - The article outlines the importance of reward models in RLHF, which are trained using human preference data to guide model outputs [44][46]. Group 5: Evaluation of Post-Training Models - The evaluation of post-training models is multifaceted, requiring a combination of automated and human assessments to capture various quality aspects [57][58]. - Automated evaluations are cost-effective and quick, while human evaluations provide a more subjective quality measure, especially for nuanced tasks [59][60].
SimpleVLA-RL:突破 VLA 模型训练瓶颈,RL实现端到端在线训练
具身智能之心· 2025-09-15 00:04
Core Insights - The article discusses the development of the SimpleVLA-RL framework, which enhances the training of Visual-Language-Action (VLA) models in robotics through reinforcement learning (RL) techniques, addressing the limitations of traditional supervised fine-tuning (SFT) methods [2][4][30] Group 1: Research Background and Challenges - VLA models are crucial for integrating visual perception, language understanding, and action generation in robotic control, but current training methods face significant challenges, including data scarcity and weak generalization capabilities [2][5] - The breakthrough in large reasoning models suggests that RL can improve the sequential action planning capabilities of VLA models, but traditional RL methods are limited by manual reward design and the high cost of environmental interactions [2][5] Group 2: Contributions of SimpleVLA-RL - SimpleVLA-RL is designed specifically for VLA, incorporating interactive trajectory sampling and multi-environment parallel rendering, which significantly reduces training costs and improves scalability [6][9] - The framework has achieved state-of-the-art (SOTA) performance across multiple benchmarks, with notable improvements in success rates, such as LIBERO's average success rate increasing from 91.0% to 99.1% [6][12] - SimpleVLA-RL demonstrates enhanced data efficiency, achieving a LIBERO average success rate of 96.9% with only one demonstration trajectory, surpassing traditional methods [16][17] Group 3: Generalization and Real-World Application - The framework shows robust generalization capabilities across unseen tasks, with significant performance improvements in various scenarios, indicating its ability to learn universal skills rather than overfitting to specific data [22][30] - SimpleVLA-RL has proven effective in sim-to-real transfer, with real-world task success rates improving from 17.5% to 38.5%, validating its deployment capabilities [7][21] Group 4: Key Discoveries - The framework has led to the discovery of the "Pushcut" phenomenon, where the RL-trained model autonomously develops more efficient strategies beyond human demonstrations, showcasing the potential for innovative robotic behaviors [24][30] - The effectiveness of SimpleVLA-RL is contingent on the initial model capabilities, with significant performance enhancements observed when starting from a higher baseline success rate [28][29]
万字长文!首篇智能体自进化综述:迈向超级人工智能之路
自动驾驶之心· 2025-09-11 23:33
Core Insights - The article discusses the transition from static large language models (LLMs) to self-evolving agents capable of continuous learning and adaptation in dynamic environments, paving the way towards artificial superintelligence (ASI) [3][4][46] - It emphasizes the need for a structured framework to understand and design self-evolving agents, focusing on three fundamental questions: what to evolve, when to evolve, and how to evolve [6][46] Group 1: What to Evolve - Self-evolving agents can improve various components such as models, memory, tools, and architecture over time to enhance performance and adaptability [19][20] - The evolution of these components is crucial for the agent's ability to handle complex tasks and environments effectively [19][20] Group 2: When to Evolve - The article categorizes self-evolution into two time modes: intra-test-time self-evolution, which occurs during task execution, and inter-test-time self-evolution, which happens between tasks [22][23] - Intra-test-time self-evolution allows agents to adapt in real-time to specific challenges, while inter-test-time self-evolution leverages accumulated experiences for future performance improvements [22][23] Group 3: How to Evolve - Self-evolution emphasizes a continuous learning process where agents learn from real-world interactions, seek feedback, and adjust strategies dynamically [26][27] - Various methodologies for self-evolution include reward-based evolution, imitation learning, and population-based approaches, each with distinct feedback types and data sources [29][30] Group 4: Applications and Evaluation - Self-evolving agents have significant potential in various fields, including programming, education, and healthcare, where continuous adaptation is essential [6][34] - Evaluating self-evolving agents presents unique challenges, requiring metrics that capture adaptability, knowledge retention, and long-term generalization capabilities [34][36] Group 5: Future Directions - The article highlights the importance of addressing challenges such as catastrophic forgetting, knowledge transfer, and ensuring safety and controllability in self-evolving agents [40][43] - Future research should focus on developing scalable architectures, dynamic evaluation methods, and personalized agents that can adapt to individual user preferences [38][44]
大模型开始打王者荣耀了
量子位· 2025-09-02 01:40
Core Insights - The article discusses the implementation of the Think-In-Games (TiG) framework, which allows large language models to play the game Honor of Kings while learning in real-time, effectively bridging the gap between decision-making and action [1][3][4]. Group 1: TiG Framework Overview - TiG redefines decision-making based on reinforcement learning as a language modeling task, enabling models to generate strategies guided by language and optimize them through online reinforcement learning [3][4]. - The framework allows large language models to learn macro-level reasoning skills, focusing on long-term goals and team coordination rather than just micro-level actions [6][9]. - The model acts more like a strategic coach than a professional player, converting decisions into text and selecting macro actions based on game state [7][9]. Group 2: Training Methodology - The training process involves a multi-stage approach combining supervised fine-tuning (SFT) and reinforcement learning (RL) to enhance model capabilities [12][16]. - The research team utilized a "relabeling algorithm" to ensure each game state is tagged with the most critical macro action, providing a robust signal for subsequent training [9][11]. - The Group Relative Policy Optimization (GRPO) algorithm is employed to maximize the advantages of generated content while limiting divergence from reference models [9][11]. Group 3: Experimental Results - The results indicate that the combination of SFT and GRPO significantly improves model performance, with Qwen-2.5-32B's accuracy increasing from 66.67% to 86.84% after applying GRPO [14][15]. - The Qwen-3-14B model achieved an impressive accuracy of 90.91% after training with SFT and GRPO [2][15]. - The TiG framework demonstrates competitive performance compared to traditional reinforcement learning methods while significantly reducing data and computational requirements [17].
还在卷端到端模型?Embodied-R1另辟蹊径:用“指向”+强化学习实现SOTA性能!
具身智能之心· 2025-09-02 00:03
Core Insights - The article discusses the development of Embodied-R1, a new model designed to bridge the "seeing-to-doing gap" in robotics, which has been a long-standing challenge in the field [2][32] - The model introduces a novel intermediate representation called "pointing," which allows complex operational instructions to be translated into visual points, enhancing the robot's ability to understand and execute tasks [3][10] Group 1: Challenges in Robotics - The "seeing-to-doing gap" is primarily caused by data scarcity and morphological heterogeneity, which hinder effective knowledge transfer in robotics [2] - Existing visual-language-action (VLA) models struggle with performance in new environments, often losing zero-shot operational capabilities [2][10] Group 2: Embodied-R1 Model Overview - Embodied-R1 is a 3 billion parameter model that utilizes "pointing" as an intuitive intermediate representation, defining four key capabilities: REG (representational understanding), RRG (spatial region pointing), OFG (functional part pointing), and VTG (visual trajectory generation) [10][12] - The model has demonstrated superior performance in 11 spatial reasoning and pointing tasks, achieving a 56.2% success rate in the SIMPLEREnv simulation and an impressive 87.5% in eight real-world tasks without fine-tuning [10][27] Group 3: Training Methodology - The model employs a two-phase training curriculum, focusing first on spatial reasoning and then on embodied pointing capabilities, utilizing a large dataset of 200,000 samples [15][16] - Reinforcement fine-tuning (RFT) is introduced to address the "multi-solution dilemma" in pointing tasks, allowing the model to develop a generalized understanding rather than memorizing specific answers [17][19] Group 4: Performance Metrics - Embodied-R1 outperforms other models in various benchmarks, achieving state-of-the-art (SOTA) results in REG, RRG, OFG, and VTG tasks [29][30] - The model's trajectory generation quality is the best among all compared models, which is crucial for reliable robot execution [29] Group 5: Robustness and Adaptability - The model exhibits strong robustness against visual disturbances, maintaining performance even under challenging conditions such as poor lighting and background changes [31] - This adaptability is attributed to the "pointing" representation, which enhances the robot's strategic robustness [31] Group 6: Conclusion - The introduction of Embodied-R1 marks a significant advancement in addressing the long-standing "seeing-to-doing gap" in robotics, providing a promising pathway for developing more powerful and generalizable embodied AI systems [32]