DAPO - filings, earnings calls, financial reports, news

DAPO

Search documents

NeurIPS25高分论文｜以判别式监督学习强化推理LLM，解决难度偏差和熵崩塌难题

机器之心· 2025-10-26 07:00

Core Insights - The article discusses the introduction of a novel framework called Discriminative Constrained Optimization (DisCO) aimed at enhancing large reasoning models (LRMs) by addressing inherent limitations of the Group Relative Policy Optimization (GRPO) method, particularly in binary reward settings [3][4][6][32]. Summary by Sections Introduction to DisCO - DisCO is proposed as a solution to the difficulty bias and entropy instability issues found in GRPO and its variants, allowing for the integration of advanced discriminative learning techniques to tackle data imbalance problems [4][6][32]. Advantages of DisCO - DisCO significantly outperforms GRPO and its improved versions, achieving an average gain of 7% over GRPO and 6% over DAPO across six benchmark tasks with a 1.5 billion parameter model [4][22]. - Notably, DisCO with a maximum response length of 8k outperforms GRPO with a maximum response length of 32k [4]. Methodology - The framework eliminates difficulty bias by adopting a discriminative optimization objective, which maximizes the score of correct answers while minimizing that of incorrect ones [6][11]. - It employs non-clipped scoring functions and a constrained optimization approach to stabilize training dynamics, addressing issues of entropy instability [6][19][28]. Experimental Results - DisCO consistently demonstrates superior performance across various models, including a 3.5% improvement over GRPO in 7 billion parameter experiments [22]. - The training dynamics of DisCO show a steady increase in training rewards and stable generation entropy, contrasting with the instability observed in GRPO and its variants [27][28]. Ablation Studies - The analysis of individual components within DisCO reveals that each component contributes significantly to its overall performance, with the use of non-clipped scoring functions being particularly critical [30]. Future Directions - While the current focus is on binary rewards, the authors suggest that future research could explore the application of DisCO to non-binary reward scenarios, potentially utilizing novel scoring functions from supervised learning [32].

科普向：一文解构大模型后训练，GRPO和它的继任者们的前世今生

3 6 Ke· 2025-09-01 04:38

Group 1 - The core concept of the article revolves around the evolution of post-training methods in large language models, particularly focusing on the GRPO algorithm as a significant advancement in reinforcement learning paradigms [2][46]. - GRPO has emerged as a universal reinforcement learning algorithm applicable to a wide range of post-training tasks, with notable improvements over previous methods like PPO [2][48]. - The article discusses the importance of post-training in enhancing the adaptability and flexibility of models, addressing the limitations of pre-training alone [5][46]. Group 2 - The article highlights the transition from PPO to GRPO, emphasizing the reduction of computational costs and memory requirements, making GRPO a more efficient alternative [18][14]. - GRPO's methodology involves using historical performance data to establish a baseline for advantage estimation, eliminating the need for a separate value function [16][14]. - Despite its advantages, GRPO still faces stability issues, prompting further research and development of improved algorithms like DAPO and GSPO [19][48]. Group 3 - DAPO, developed by ByteDance and Tsinghua AIR, builds upon GRPO by introducing enhancements such as Clip-Higher and dynamic sampling to improve training efficiency [20][21]. - GSPO represents a significant advancement by shifting the focus from token-level to sequence-level importance sampling, which enhances training stability [28][30]. - GFPO addresses the limitations of GRPO by allowing for the simultaneous optimization of multiple response attributes, thus improving the overall performance of models [33][34].

Microsoft(US:MSFT)

大模型后训练

强化学习

Artificial Intelligence

Artificial Intelligence

GFPO

GPT

GRPO

科普向：一文解构大模型后训练，GRPO和它的继任者们的前世今生

机器之心· 2025-09-01 02:49

Core Viewpoint - The article discusses the evolution and significance of the Group Relative Policy Optimization (GRPO) algorithm in the context of large language models and reinforcement learning, highlighting its advantages and limitations compared to previous methods like Proximal Policy Optimization (PPO) [4][38]. Summary by Sections Development of Large Language Models - The rapid advancement of large language models has led to the emergence of various post-training methods, with GRPO being a notable innovation that enhances reinforcement learning paradigms [3][5]. Post-Training and Reinforcement Learning - Post-training is crucial for refining models' capabilities in specific domains, enhancing adaptability and flexibility to meet diverse application needs [12][11]. - Reinforcement learning, particularly through human feedback (RLHF), plays a vital role in the post-training phase, aiming to optimize model outputs based on user preferences [14][19]. GRPO and Its Advantages - GRPO eliminates the need for a separate critic model, reducing memory and computational costs significantly compared to PPO, which requires dual networks [30][35]. - The GRPO framework utilizes historical performance data to establish a baseline for evaluating model improvements, thus simplifying the training process [34][35]. Comparison of GRPO and PPO - GRPO offers substantial improvements in memory requirements and training speed, making it a more efficient choice for large language model training [37]. - Despite its advantages, GRPO still faces stability issues similar to those of PPO, particularly in smaller-scale reinforcement learning tasks [39]. Recent Innovations: DAPO, GSPO, and GFPO - DAPO introduces enhancements to GRPO, such as Clip-Higher and dynamic sampling, to address practical challenges encountered during training [41][42]. - GSPO advances the methodology by shifting the focus from token-level to sequence-level importance sampling, significantly improving training stability [48][49]. - GFPO allows for simultaneous optimization of multiple response attributes, addressing limitations of GRPO related to scalar feedback and multi-round reasoning tasks [61][63]. Conclusion - The evolution of post-training methods, from PPO to GRPO and beyond, illustrates a clear trajectory in optimizing large language models, with GRPO serving as a pivotal point for further advancements in the field [81][82].

Artificial Intelligence

Artificial Intelligence

GRPO

DAPO

让强化学习快如闪电：FlashRL一条命令实现极速Rollout，已全部开源

机器之心· 2025-08-12 09:51

Core Viewpoint - The article discusses the development and implementation of FlashRL, an open-source reinforcement learning solution that utilizes quantized rollouts without sacrificing downstream performance, addressing the challenges of rollout-training mismatch through the introduction of Truncated Importance Sampling (TIS) [4][16][37]. Group 1: DAPO and Rollout Challenges - DAPO, developed by Tsinghua AIR and ByteDance, is an open-source SOTA system for large-scale LLM reinforcement learning, achieving a score of 50 on the AIME 2024 benchmark with the Qwen2.5-32B model [1]. - The research team identified that rollout generation is a major bottleneck in reinforcement learning training, consuming approximately 70% of total training time [3]. - The application of 8-bit quantization during rollout generation, combined with TIS technology, significantly accelerates the process while maintaining downstream performance [3][4]. Group 2: FlashRL Implementation - FlashRL is the first open-source reinforcement learning implementation that applies INT8/FP8 during the rollout phase, achieving performance parity with BF16 without any performance loss [4][15]. - The introduction of TIS mitigates the rollout-training mismatch, allowing quantized rollout training to achieve performance levels comparable to BF16 rollout training, and even surpassing naive BF16 rollout training [16][37]. - FlashRL supports online quantization and has been integrated with existing inference engines like vLLM to enhance their capabilities for models with parameter updates [22]. Group 3: Performance and Acceleration - FlashRL's INT8 rollout can provide up to 1.7 times throughput improvement while retaining the advantages of reinforcement learning [23]. - In standard environments, the acceleration observed with 8-bit quantization is more pronounced in larger models, with a speedup of up to 1.75 times for the 32B model compared to BF16 [29]. - In memory-constrained environments, INT8 quantization can lead to over 3 times speedup in generation speed, highlighting its potential for larger models [34]. Group 4: Validation and Usage - The effectiveness of FlashRL was validated in training the DAPO-32B model, demonstrating that INT8 rollout significantly improves training speed without compromising accuracy on the AIME benchmark [36][37]. - FlashRL can be easily implemented with a single command, allowing users to integrate it into their RL training without code modifications [41].

强化学习

量化技术

Artificial Intelligence

Artificial Intelligence

FlashRL

DAPO

DeepSeek用的GRPO有那么特别吗？万字长文分析四篇精品论文

机器之心· 2025-05-24 03:13

Core Insights - The article discusses recent advancements in reasoning models, particularly focusing on GRPO and its improved algorithms, highlighting the rapid evolution of AI in the context of reinforcement learning and reasoning [1][2][3]. Group 1: Key Papers and Models - Kimi k1.5 is a newly released reasoning model that employs reinforcement learning techniques and emphasizes long context extension and improved strategy optimization [10][17]. - OpenReasonerZero is the first complete reproduction of reinforcement learning training on a foundational model, showcasing significant results [34][36]. - DAPO explores improvements to GRPO to better adapt to reasoning training, presenting a large-scale open-source LLM reinforcement learning system [48][54]. Group 2: GRPO and Its Characteristics - GRPO is closely related to PPO (Proximal Policy Optimization) and shares similarities with RLOO (REINFORCE Leave One Out), indicating that many leading research works do not utilize GRPO [11][12][9]. - The core understanding is that current RL algorithms are highly similar in implementation, with GRPO being popular but not fundamentally revolutionary [15][6]. - GRPO includes clever modifications specifically for reasoning training rather than traditional RLHF scenarios, focusing on generating multiple answers for reasoning tasks [13][12]. Group 3: Training Techniques and Strategies - Kimi k1.5's training involves supervised fine-tuning (SFT) and emphasizes behavior patterns such as planning, evaluation, reflection, and exploration [23][24]. - The training methods include a sequence strategy that starts with simpler tasks and gradually increases complexity, akin to human learning processes [27][28]. - The paper discusses the importance of data distribution and the quality of prompts in ensuring effective reinforcement learning [22][41]. Group 4: DAPO Improvements - DAPO introduces two distinct clipping hyperparameters to enhance the learning dynamics and efficiency of the model [54][60]. - It also emphasizes dynamic sampling by removing samples with flat rewards from the batch to improve learning speed [63]. - The use of token-level loss rather than per-response loss is proposed to better manage learning dynamics and avoid issues with long responses [64][66]. Group 5: Dr. GRPO Modifications - Dr. GRPO aims to improve learning dynamics by modifying GRPO to achieve stronger performance with shorter generated lengths [76][79]. - The modifications include normalizing advantages across all tokens in a response, which helps in managing the learning signal effectively [80][81]. - The paper highlights the importance of high-quality data engineering in absorbing the effects of these changes, emphasizing the need for a balanced distribution of problem difficulty [82][89].