Workflow
监督微调(SFT)
icon
Search documents
听说,大家都在梭后训练?最佳指南来了
机器之心· 2025-10-09 02:24
Core Insights - The article emphasizes the shift in focus from pre-training to post-training in large language models (LLMs), highlighting the diminishing returns of scaling laws as model sizes reach hundreds of billions of parameters [2][3][11]. Group 1: Importance of Post-Training - Post-training is recognized as a crucial phase for enhancing the reasoning capabilities of models like OpenAI's series, DeepSeek R1, and Google Gemini, marking it as a necessary step towards advanced intelligence [3][11]. - The article introduces various innovative post-training methods such as Reinforcement Learning from Human Feedback (RLHF), Reinforcement Learning from AI Feedback (RLAIF), and Reinforcement Learning with Verifiable Rewards (RLVR) [2][3][12]. Group 2: Transition from Pre-Training to Post-Training - The evolution from pre-training to instruction fine-tuning is discussed, where foundational models are trained on large datasets to predict the next token, but often lack practical utility in real-world applications [7][8]. - Post-training aims to align model behavior with user expectations, focusing on quality over quantity in the datasets used, which are typically smaller but more refined compared to pre-training datasets [11][24]. Group 3: Supervised Fine-Tuning (SFT) - Supervised Fine-Tuning (SFT) is described as a process that transforms a pre-trained model into one that can follow user instructions effectively, relying on high-quality instruction-answer pairs [21][24]. - The quality of the SFT dataset is critical, as even a small number of low-quality samples can negatively impact the model's performance [25][26]. Group 4: Reinforcement Learning Techniques - Reinforcement Learning (RL) is highlighted as a complex yet effective method for model fine-tuning, with various reward mechanisms such as RLHF, RLAIF, and RLVR being employed to enhance model performance [39][41]. - The article outlines the importance of reward models in RLHF, which are trained using human preference data to guide model outputs [44][46]. Group 5: Evaluation of Post-Training Models - The evaluation of post-training models is multifaceted, requiring a combination of automated and human assessments to capture various quality aspects [57][58]. - Automated evaluations are cost-effective and quick, while human evaluations provide a more subjective quality measure, especially for nuanced tasks [59][60].
SimpleVLA-RL:突破 VLA 模型训练瓶颈,RL实现端到端在线训练
具身智能之心· 2025-09-15 00:04
Core Insights - The article discusses the development of the SimpleVLA-RL framework, which enhances the training of Visual-Language-Action (VLA) models in robotics through reinforcement learning (RL) techniques, addressing the limitations of traditional supervised fine-tuning (SFT) methods [2][4][30] Group 1: Research Background and Challenges - VLA models are crucial for integrating visual perception, language understanding, and action generation in robotic control, but current training methods face significant challenges, including data scarcity and weak generalization capabilities [2][5] - The breakthrough in large reasoning models suggests that RL can improve the sequential action planning capabilities of VLA models, but traditional RL methods are limited by manual reward design and the high cost of environmental interactions [2][5] Group 2: Contributions of SimpleVLA-RL - SimpleVLA-RL is designed specifically for VLA, incorporating interactive trajectory sampling and multi-environment parallel rendering, which significantly reduces training costs and improves scalability [6][9] - The framework has achieved state-of-the-art (SOTA) performance across multiple benchmarks, with notable improvements in success rates, such as LIBERO's average success rate increasing from 91.0% to 99.1% [6][12] - SimpleVLA-RL demonstrates enhanced data efficiency, achieving a LIBERO average success rate of 96.9% with only one demonstration trajectory, surpassing traditional methods [16][17] Group 3: Generalization and Real-World Application - The framework shows robust generalization capabilities across unseen tasks, with significant performance improvements in various scenarios, indicating its ability to learn universal skills rather than overfitting to specific data [22][30] - SimpleVLA-RL has proven effective in sim-to-real transfer, with real-world task success rates improving from 17.5% to 38.5%, validating its deployment capabilities [7][21] Group 4: Key Discoveries - The framework has led to the discovery of the "Pushcut" phenomenon, where the RL-trained model autonomously develops more efficient strategies beyond human demonstrations, showcasing the potential for innovative robotic behaviors [24][30] - The effectiveness of SimpleVLA-RL is contingent on the initial model capabilities, with significant performance enhancements observed when starting from a higher baseline success rate [28][29]
万字长文!首篇智能体自进化综述:迈向超级人工智能之路
自动驾驶之心· 2025-09-11 23:33
Core Insights - The article discusses the transition from static large language models (LLMs) to self-evolving agents capable of continuous learning and adaptation in dynamic environments, paving the way towards artificial superintelligence (ASI) [3][4][46] - It emphasizes the need for a structured framework to understand and design self-evolving agents, focusing on three fundamental questions: what to evolve, when to evolve, and how to evolve [6][46] Group 1: What to Evolve - Self-evolving agents can improve various components such as models, memory, tools, and architecture over time to enhance performance and adaptability [19][20] - The evolution of these components is crucial for the agent's ability to handle complex tasks and environments effectively [19][20] Group 2: When to Evolve - The article categorizes self-evolution into two time modes: intra-test-time self-evolution, which occurs during task execution, and inter-test-time self-evolution, which happens between tasks [22][23] - Intra-test-time self-evolution allows agents to adapt in real-time to specific challenges, while inter-test-time self-evolution leverages accumulated experiences for future performance improvements [22][23] Group 3: How to Evolve - Self-evolution emphasizes a continuous learning process where agents learn from real-world interactions, seek feedback, and adjust strategies dynamically [26][27] - Various methodologies for self-evolution include reward-based evolution, imitation learning, and population-based approaches, each with distinct feedback types and data sources [29][30] Group 4: Applications and Evaluation - Self-evolving agents have significant potential in various fields, including programming, education, and healthcare, where continuous adaptation is essential [6][34] - Evaluating self-evolving agents presents unique challenges, requiring metrics that capture adaptability, knowledge retention, and long-term generalization capabilities [34][36] Group 5: Future Directions - The article highlights the importance of addressing challenges such as catastrophic forgetting, knowledge transfer, and ensuring safety and controllability in self-evolving agents [40][43] - Future research should focus on developing scalable architectures, dynamic evaluation methods, and personalized agents that can adapt to individual user preferences [38][44]
大模型开始打王者荣耀了
量子位· 2025-09-02 01:40
Core Insights - The article discusses the implementation of the Think-In-Games (TiG) framework, which allows large language models to play the game Honor of Kings while learning in real-time, effectively bridging the gap between decision-making and action [1][3][4]. Group 1: TiG Framework Overview - TiG redefines decision-making based on reinforcement learning as a language modeling task, enabling models to generate strategies guided by language and optimize them through online reinforcement learning [3][4]. - The framework allows large language models to learn macro-level reasoning skills, focusing on long-term goals and team coordination rather than just micro-level actions [6][9]. - The model acts more like a strategic coach than a professional player, converting decisions into text and selecting macro actions based on game state [7][9]. Group 2: Training Methodology - The training process involves a multi-stage approach combining supervised fine-tuning (SFT) and reinforcement learning (RL) to enhance model capabilities [12][16]. - The research team utilized a "relabeling algorithm" to ensure each game state is tagged with the most critical macro action, providing a robust signal for subsequent training [9][11]. - The Group Relative Policy Optimization (GRPO) algorithm is employed to maximize the advantages of generated content while limiting divergence from reference models [9][11]. Group 3: Experimental Results - The results indicate that the combination of SFT and GRPO significantly improves model performance, with Qwen-2.5-32B's accuracy increasing from 66.67% to 86.84% after applying GRPO [14][15]. - The Qwen-3-14B model achieved an impressive accuracy of 90.91% after training with SFT and GRPO [2][15]. - The TiG framework demonstrates competitive performance compared to traditional reinforcement learning methods while significantly reducing data and computational requirements [17].
还在卷端到端模型?Embodied-R1另辟蹊径:用“指向”+强化学习实现SOTA性能!
具身智能之心· 2025-09-02 00:03
Core Insights - The article discusses the development of Embodied-R1, a new model designed to bridge the "seeing-to-doing gap" in robotics, which has been a long-standing challenge in the field [2][32] - The model introduces a novel intermediate representation called "pointing," which allows complex operational instructions to be translated into visual points, enhancing the robot's ability to understand and execute tasks [3][10] Group 1: Challenges in Robotics - The "seeing-to-doing gap" is primarily caused by data scarcity and morphological heterogeneity, which hinder effective knowledge transfer in robotics [2] - Existing visual-language-action (VLA) models struggle with performance in new environments, often losing zero-shot operational capabilities [2][10] Group 2: Embodied-R1 Model Overview - Embodied-R1 is a 3 billion parameter model that utilizes "pointing" as an intuitive intermediate representation, defining four key capabilities: REG (representational understanding), RRG (spatial region pointing), OFG (functional part pointing), and VTG (visual trajectory generation) [10][12] - The model has demonstrated superior performance in 11 spatial reasoning and pointing tasks, achieving a 56.2% success rate in the SIMPLEREnv simulation and an impressive 87.5% in eight real-world tasks without fine-tuning [10][27] Group 3: Training Methodology - The model employs a two-phase training curriculum, focusing first on spatial reasoning and then on embodied pointing capabilities, utilizing a large dataset of 200,000 samples [15][16] - Reinforcement fine-tuning (RFT) is introduced to address the "multi-solution dilemma" in pointing tasks, allowing the model to develop a generalized understanding rather than memorizing specific answers [17][19] Group 4: Performance Metrics - Embodied-R1 outperforms other models in various benchmarks, achieving state-of-the-art (SOTA) results in REG, RRG, OFG, and VTG tasks [29][30] - The model's trajectory generation quality is the best among all compared models, which is crucial for reliable robot execution [29] Group 5: Robustness and Adaptability - The model exhibits strong robustness against visual disturbances, maintaining performance even under challenging conditions such as poor lighting and background changes [31] - This adaptability is attributed to the "pointing" representation, which enhances the robot's strategic robustness [31] Group 6: Conclusion - The introduction of Embodied-R1 marks a significant advancement in addressing the long-standing "seeing-to-doing gap" in robotics, providing a promising pathway for developing more powerful and generalizable embodied AI systems [32]
为什么行业如此痴迷于强化学习?
自动驾驶之心· 2025-07-13 13:18
Core Viewpoint - The article discusses a significant research paper that explores the effectiveness of reinforcement learning (RL) compared to supervised fine-tuning (SFT) in training AI models, particularly focusing on the concept of generalization and transferability of knowledge across different tasks [1][5][14]. Group 1: Training Methods - There are two primary methods for training AI models: imitation (SFT) and exploration (RL) [2][3]. - Imitation learning involves training models to replicate data, while exploration allows models to discover solutions independently, assuming they have a non-random chance of solving problems [3][6]. Group 2: Generalization and Transferability - The core of the research is the concept of generalization, where SFT may hinder the ability to adapt known knowledge to unknown domains, while RL promotes better transferability [5][7]. - A Transferability Index (TI) was introduced to measure the ability to transfer skills across tasks, revealing that RL-trained models showed positive transfer in various reasoning tasks, while SFT models often exhibited negative transfer in non-reasoning tasks [7][8]. Group 3: Experimental Findings - The study conducted rigorous experiments comparing RL and SFT models, finding that RL models improved performance in unrelated fields, while SFT models declined in non-mathematical areas despite performing well in mathematical tasks [10][14]. - The results indicated that RL models maintained a more stable internal knowledge structure, allowing them to adapt better to new domains without losing foundational knowledge [10][14]. Group 4: Implications for AI Development - The findings suggest that while imitation learning has been a preferred method, reinforcement learning offers a promising approach for developing intelligent systems capable of generalizing knowledge across various fields [14][15]. - The research emphasizes that true intelligence in AI involves the ability to apply learned concepts to new situations, akin to human learning processes [14][15].
豆蔻妇科大模型再突破:钉钉行业训练平台+精标数据SFT ,准确率从 77.1%上升至 90.2%
Tai Mei Ti A P P· 2025-07-10 07:49
Core Insights - The article discusses the limitations of general large language models in clinical scenarios, particularly in providing accurate medical diagnoses, highlighting the need for specialized training methods like supervised fine-tuning (SFT) [1][2][3] - The performance of the Doukou Gynecology model improved significantly from an initial accuracy of 77.1% to 90.2% through targeted SFT processes [1][3] Data Quality Control - The training dataset underwent a rigorous selection process involving systematic data cleaning, ensuring consistency between reasoning and results, and verifying the logical integrity of the data [2][5] - Low-quality data, such as those with clear medical inconsistencies, were excluded to maintain high standards [2] Model Training Phases - The first phase involved building a foundational SFT model using 1,300 meticulously labeled gynecological consultation data, achieving an initial accuracy of 77.1% [3] - The second phase focused on synthesizing symptom data and refining the model, resulting in a final diagnostic accuracy of 90.2% for six major gynecological symptoms [3][6] Iterative Optimization - Continuous iterative optimization was implemented, where high-quality samples scoring above 8 were added to the training set for further SFT, creating a cycle of training, evaluation, and retraining [10][18] - Key performance indicators were monitored throughout the process to ensure comprehensive model improvement [10] Evaluation System - A dual evaluation system was established, combining automated assessments with manual reviews by medical experts to ensure diagnostic accuracy [11][13] - The automated evaluation system utilized a high-performance language model to objectively score outputs based on a structured framework [11] Challenges and Lessons Learned - Initial reliance on manual labeling slowed data accumulation and increased costs, prompting a shift to a more efficient "machine distillation → expert review → post-training evaluation" system [14][15] - The model's ability to recognize rare diseases was enhanced through balanced sampling strategies [15] Future Directions - The company plans to explore a collaborative training paradigm combining SFT and reinforcement learning (RL) to enhance clinical reasoning capabilities [18]
大模型刷数学题竟有害?CMU评估20+模型指出训练陷阱
量子位· 2025-07-07 06:13
Core Viewpoint - The article discusses the relationship between mathematical reasoning capabilities of large language models (LLMs) and their ability to transfer these skills to other tasks, highlighting that models trained with reinforcement learning (RL) show better transferability compared to those trained with supervised fine-tuning (SFT) [4][11]. Group 1: Mathematical Reasoning and Transferability - Research indicates that only models trained with RL can effectively transfer mathematical reasoning skills to other tasks, while SFT models show limited or no transfer [4][11]. - A Transferability Index (TI) is introduced to quantify the extent to which improvements in mathematical reasoning can be applied to other reasoning and non-reasoning tasks [8][9]. - If TI is greater than 0, it indicates a positive transfer effect to other tasks; if less than 0, it indicates negative transfer [9]. Group 2: Experimental Findings - The study evaluated over 20 models across various tasks, including mathematical reasoning, other reasoning tasks (like medical reasoning), and non-reasoning tasks (like common-sense dialogue) [7]. - Results show that models fine-tuned with RL consistently achieve higher transferability metrics across reasoning and non-reasoning tasks, while SFT models often experience negative transfer in non-reasoning tasks [11]. Group 3: Model Representation and Performance - PCA analysis reveals that RL fine-tuned models exhibit minimal shifts in representation space, indicating they retain previously learned knowledge while enhancing performance in specific domains [15]. - RL models demonstrate lower KL divergence in reasoning and non-reasoning tasks compared to SFT models, suggesting more stable and precise representation updates [16][18]. - The findings suggest that RL is crucial for achieving transferable reasoning capabilities in LLMs, marking another victory for reinforcement learning in this context [19].
同时监督和强化的单阶段大模型微调,告别“先背书再刷题”,推理泛化双提升|中科院&美团等
量子位· 2025-07-02 02:02
Core Viewpoint - The article introduces the Supervised Reinforcement Fine-Tuning (SRFT) method, which combines supervised fine-tuning (SFT) and reinforcement learning (RL) in a single-stage approach to enhance the reasoning performance of large language models (LLMs) [1][22]. Group 1: Methodology - SRFT employs a dual strategy design to effectively utilize demonstration data, incorporating both SFT for coarse-grained behavior policy approximation and RL for fine-grained policy refinement [23][24]. - The method introduces an entropy-aware adaptive weighting mechanism to balance the influence of SFT and RL, ensuring stable training dynamics [29][44]. - SRFT achieves a significant improvement in training efficiency, speeding up the process by 2.28 times compared to traditional sequential methods [21][44]. Group 2: Performance Results - SRFT demonstrates an average accuracy of 59.1% across five mathematical reasoning tasks, outperforming the zero-RL baseline by 9.0% [4][47]. - In out-of-distribution tasks, SRFT achieves an average accuracy of 62.5%, surpassing the best baseline by 10.9% [4][47]. - The method shows superior generalization capabilities, with consistent performance improvements across various benchmarks [47][48]. Group 3: Training Dynamics - The training dynamics of SRFT reveal a more stable and efficient learning process, with a gradual increase in response length indicating a deeper reasoning process [48]. - SRFT maintains a more stable entropy during training, allowing for continued exploration, unlike pure RL which exhibits rapid entropy decline [20][48]. - The analysis of training trajectories indicates that SRFT effectively balances knowledge acquisition and self-exploration without excessive deviation from the initial model [15][45].
SFT在帮倒忙?新研究:直接进行强化学习,模型多模态推理上限更高
机器之心· 2025-06-01 03:30
Core Insights - The article discusses the limitations of the "Supervised Fine-Tuning (SFT) + Reinforcement Learning (RL)" paradigm in developing large vision-language models (LVLM), suggesting that SFT may hinder learning and lead to superficial reasoning paths, while RL promotes genuine multimodal reasoning [3][11][21]. Group 1: Research Findings - A study from the University of California, Santa Cruz, and the University of Texas at Dallas reveals that SFT can obstruct learning, often resulting in "pseudo-reasoning paths" that lack depth [3][11]. - The research team created the VLAA-Thinking dataset to systematically investigate the roles of SFT and RL in multimodal reasoning, highlighting the unique contributions of each method [4][8]. - The findings indicate that while SFT improves performance on standard tasks, it falls short in enhancing complex reasoning capabilities, leading to a 47% relative performance decline in a 7B model [11][13]. Group 2: Data and Methodology - The VLAA-Thinking dataset comprises 203,182 samples, with 126,413 for SFT and 25,195 for RL, designed to facilitate high-quality reasoning chains [5][6]. - The research employed a six-stage data processing workflow to effectively transfer reasoning capabilities from pure text models to LVLMs [6][8]. - A mixed reward function was innovatively designed within the GRPO framework to optimize RL in visual contexts, incorporating various reward types for different problem categories [8][19]. Group 3: Performance Analysis - The study found that SFT's imitative reasoning patterns can limit the exploration space during the RL phase, suggesting that direct learning from reward signals is more effective [15][26]. - Models trained solely with GRPO outperformed those that underwent SFT, with the VLAA-Thinker-Qwen2.5-VL-3B model ranking first in the Open LMM reasoning leaderboard for 4B models, achieving a 1.8% record improvement [15][31]. - The analysis revealed that response length and reward scores do not correlate significantly with performance, challenging previous assumptions about their relationship [24][26]. Group 4: Implications for Future Research - The findings suggest that SFT is currently incompatible with GRPO in the context of multimodal reasoning, potentially damaging the performance of both foundational and instruction-tuned LVLMs [21][22]. - The research emphasizes the need for high-quality instruction tuning to enhance model performance in RL settings, indicating that better instruction tuning leads to improved reasoning capabilities post-RL training [31].