Workflow
监督微调(SFT)
icon
Search documents
为什么行业如此痴迷于强化学习?
自动驾驶之心· 2025-07-13 13:18
Core Viewpoint - The article discusses a significant research paper that explores the effectiveness of reinforcement learning (RL) compared to supervised fine-tuning (SFT) in training AI models, particularly focusing on the concept of generalization and transferability of knowledge across different tasks [1][5][14]. Group 1: Training Methods - There are two primary methods for training AI models: imitation (SFT) and exploration (RL) [2][3]. - Imitation learning involves training models to replicate data, while exploration allows models to discover solutions independently, assuming they have a non-random chance of solving problems [3][6]. Group 2: Generalization and Transferability - The core of the research is the concept of generalization, where SFT may hinder the ability to adapt known knowledge to unknown domains, while RL promotes better transferability [5][7]. - A Transferability Index (TI) was introduced to measure the ability to transfer skills across tasks, revealing that RL-trained models showed positive transfer in various reasoning tasks, while SFT models often exhibited negative transfer in non-reasoning tasks [7][8]. Group 3: Experimental Findings - The study conducted rigorous experiments comparing RL and SFT models, finding that RL models improved performance in unrelated fields, while SFT models declined in non-mathematical areas despite performing well in mathematical tasks [10][14]. - The results indicated that RL models maintained a more stable internal knowledge structure, allowing them to adapt better to new domains without losing foundational knowledge [10][14]. Group 4: Implications for AI Development - The findings suggest that while imitation learning has been a preferred method, reinforcement learning offers a promising approach for developing intelligent systems capable of generalizing knowledge across various fields [14][15]. - The research emphasizes that true intelligence in AI involves the ability to apply learned concepts to new situations, akin to human learning processes [14][15].
豆蔻妇科大模型再突破:钉钉行业训练平台+精标数据SFT ,准确率从 77.1%上升至 90.2%
Tai Mei Ti A P P· 2025-07-10 07:49
Core Insights - The article discusses the limitations of general large language models in clinical scenarios, particularly in providing accurate medical diagnoses, highlighting the need for specialized training methods like supervised fine-tuning (SFT) [1][2][3] - The performance of the Doukou Gynecology model improved significantly from an initial accuracy of 77.1% to 90.2% through targeted SFT processes [1][3] Data Quality Control - The training dataset underwent a rigorous selection process involving systematic data cleaning, ensuring consistency between reasoning and results, and verifying the logical integrity of the data [2][5] - Low-quality data, such as those with clear medical inconsistencies, were excluded to maintain high standards [2] Model Training Phases - The first phase involved building a foundational SFT model using 1,300 meticulously labeled gynecological consultation data, achieving an initial accuracy of 77.1% [3] - The second phase focused on synthesizing symptom data and refining the model, resulting in a final diagnostic accuracy of 90.2% for six major gynecological symptoms [3][6] Iterative Optimization - Continuous iterative optimization was implemented, where high-quality samples scoring above 8 were added to the training set for further SFT, creating a cycle of training, evaluation, and retraining [10][18] - Key performance indicators were monitored throughout the process to ensure comprehensive model improvement [10] Evaluation System - A dual evaluation system was established, combining automated assessments with manual reviews by medical experts to ensure diagnostic accuracy [11][13] - The automated evaluation system utilized a high-performance language model to objectively score outputs based on a structured framework [11] Challenges and Lessons Learned - Initial reliance on manual labeling slowed data accumulation and increased costs, prompting a shift to a more efficient "machine distillation → expert review → post-training evaluation" system [14][15] - The model's ability to recognize rare diseases was enhanced through balanced sampling strategies [15] Future Directions - The company plans to explore a collaborative training paradigm combining SFT and reinforcement learning (RL) to enhance clinical reasoning capabilities [18]
大模型刷数学题竟有害?CMU评估20+模型指出训练陷阱
量子位· 2025-07-07 06:13
Core Viewpoint - The article discusses the relationship between mathematical reasoning capabilities of large language models (LLMs) and their ability to transfer these skills to other tasks, highlighting that models trained with reinforcement learning (RL) show better transferability compared to those trained with supervised fine-tuning (SFT) [4][11]. Group 1: Mathematical Reasoning and Transferability - Research indicates that only models trained with RL can effectively transfer mathematical reasoning skills to other tasks, while SFT models show limited or no transfer [4][11]. - A Transferability Index (TI) is introduced to quantify the extent to which improvements in mathematical reasoning can be applied to other reasoning and non-reasoning tasks [8][9]. - If TI is greater than 0, it indicates a positive transfer effect to other tasks; if less than 0, it indicates negative transfer [9]. Group 2: Experimental Findings - The study evaluated over 20 models across various tasks, including mathematical reasoning, other reasoning tasks (like medical reasoning), and non-reasoning tasks (like common-sense dialogue) [7]. - Results show that models fine-tuned with RL consistently achieve higher transferability metrics across reasoning and non-reasoning tasks, while SFT models often experience negative transfer in non-reasoning tasks [11]. Group 3: Model Representation and Performance - PCA analysis reveals that RL fine-tuned models exhibit minimal shifts in representation space, indicating they retain previously learned knowledge while enhancing performance in specific domains [15]. - RL models demonstrate lower KL divergence in reasoning and non-reasoning tasks compared to SFT models, suggesting more stable and precise representation updates [16][18]. - The findings suggest that RL is crucial for achieving transferable reasoning capabilities in LLMs, marking another victory for reinforcement learning in this context [19].
同时监督和强化的单阶段大模型微调,告别“先背书再刷题”,推理泛化双提升|中科院&美团等
量子位· 2025-07-02 02:02
Core Viewpoint - The article introduces the Supervised Reinforcement Fine-Tuning (SRFT) method, which combines supervised fine-tuning (SFT) and reinforcement learning (RL) in a single-stage approach to enhance the reasoning performance of large language models (LLMs) [1][22]. Group 1: Methodology - SRFT employs a dual strategy design to effectively utilize demonstration data, incorporating both SFT for coarse-grained behavior policy approximation and RL for fine-grained policy refinement [23][24]. - The method introduces an entropy-aware adaptive weighting mechanism to balance the influence of SFT and RL, ensuring stable training dynamics [29][44]. - SRFT achieves a significant improvement in training efficiency, speeding up the process by 2.28 times compared to traditional sequential methods [21][44]. Group 2: Performance Results - SRFT demonstrates an average accuracy of 59.1% across five mathematical reasoning tasks, outperforming the zero-RL baseline by 9.0% [4][47]. - In out-of-distribution tasks, SRFT achieves an average accuracy of 62.5%, surpassing the best baseline by 10.9% [4][47]. - The method shows superior generalization capabilities, with consistent performance improvements across various benchmarks [47][48]. Group 3: Training Dynamics - The training dynamics of SRFT reveal a more stable and efficient learning process, with a gradual increase in response length indicating a deeper reasoning process [48]. - SRFT maintains a more stable entropy during training, allowing for continued exploration, unlike pure RL which exhibits rapid entropy decline [20][48]. - The analysis of training trajectories indicates that SRFT effectively balances knowledge acquisition and self-exploration without excessive deviation from the initial model [15][45].
SFT在帮倒忙?新研究:直接进行强化学习,模型多模态推理上限更高
机器之心· 2025-06-01 03:30
Core Insights - The article discusses the limitations of the "Supervised Fine-Tuning (SFT) + Reinforcement Learning (RL)" paradigm in developing large vision-language models (LVLM), suggesting that SFT may hinder learning and lead to superficial reasoning paths, while RL promotes genuine multimodal reasoning [3][11][21]. Group 1: Research Findings - A study from the University of California, Santa Cruz, and the University of Texas at Dallas reveals that SFT can obstruct learning, often resulting in "pseudo-reasoning paths" that lack depth [3][11]. - The research team created the VLAA-Thinking dataset to systematically investigate the roles of SFT and RL in multimodal reasoning, highlighting the unique contributions of each method [4][8]. - The findings indicate that while SFT improves performance on standard tasks, it falls short in enhancing complex reasoning capabilities, leading to a 47% relative performance decline in a 7B model [11][13]. Group 2: Data and Methodology - The VLAA-Thinking dataset comprises 203,182 samples, with 126,413 for SFT and 25,195 for RL, designed to facilitate high-quality reasoning chains [5][6]. - The research employed a six-stage data processing workflow to effectively transfer reasoning capabilities from pure text models to LVLMs [6][8]. - A mixed reward function was innovatively designed within the GRPO framework to optimize RL in visual contexts, incorporating various reward types for different problem categories [8][19]. Group 3: Performance Analysis - The study found that SFT's imitative reasoning patterns can limit the exploration space during the RL phase, suggesting that direct learning from reward signals is more effective [15][26]. - Models trained solely with GRPO outperformed those that underwent SFT, with the VLAA-Thinker-Qwen2.5-VL-3B model ranking first in the Open LMM reasoning leaderboard for 4B models, achieving a 1.8% record improvement [15][31]. - The analysis revealed that response length and reward scores do not correlate significantly with performance, challenging previous assumptions about their relationship [24][26]. Group 4: Implications for Future Research - The findings suggest that SFT is currently incompatible with GRPO in the context of multimodal reasoning, potentially damaging the performance of both foundational and instruction-tuned LVLMs [21][22]. - The research emphasizes the need for high-quality instruction tuning to enhance model performance in RL settings, indicating that better instruction tuning leads to improved reasoning capabilities post-RL training [31].
业界突破多模态泛化推理能力,OPPO研究院&港科广提出OThink-MR1技术
量子位· 2025-03-30 02:37
Core Viewpoint - The article discusses the introduction of a new technology called OThink-MR1, developed by researchers from OPPO Research Institute and Hong Kong University of Science and Technology, which enhances multimodal language models' generalized reasoning capabilities through dynamic reinforcement learning [1][2][29]. Group 1: Technology Overview - OThink-MR1 extends reinforcement learning to multimodal language models, enabling them to better handle complex tasks and new scenarios [1][2]. - The technology addresses the limitations of existing multimodal models that primarily rely on supervised fine-tuning (SFT), which hinders the development of general reasoning abilities [4][5]. - OThink-MR1 employs two core components: dynamic KL divergence strategy (GRPO-D) and a carefully designed reward model, significantly improving learning efficiency and reasoning capabilities [8]. Group 2: Dynamic KL Divergence Strategy - The dynamic KL divergence strategy balances exploration of new strategies and utilization of existing experiences, adapting as training progresses [10][11]. - This approach prevents the model from getting stuck in local optima by encouraging exploration in the early stages and gradually shifting towards leveraging accumulated knowledge [12]. Group 3: Reward Model - The reward model in OThink-MR1 provides two types of rewards: validation accuracy reward and format reward, guiding the model's learning process [13][14]. - These rewards help the model understand its strengths and areas for improvement, promoting targeted learning [15]. Group 4: Experimental Validation - The first experiment demonstrated that incorporating format rewards significantly improved model performance in geometric reasoning tasks, highlighting the importance of both content and format in evaluations [17]. - The second experiment tested the model's cross-task evaluation, showing that the GRPO-D trained model excelled in diverse tasks, unlike models trained with SFT [21][23]. - The third experiment revealed that OThink-MR1's GRPO-D outperformed traditional SFT methods in same-task evaluations, indicating its effectiveness in enhancing model capabilities [28]. Group 5: Future Implications - OThink-MR1 represents a significant advancement in the development of multimodal language models, showcasing the potential of dynamic reinforcement learning to enhance reasoning and generalization abilities [29].