DeepSeek用的GRPO有那么特别吗？万字长文分析四篇精品论文

Core Insights - The article discusses recent advancements in reasoning models, particularly focusing on GRPO and its improved algorithms, highlighting the rapid evolution of AI in the context of reinforcement learning and reasoning [1][2][3]. Group 1: Key Papers and Models - Kimi k1.5 is a newly released reasoning model that employs reinforcement learning techniques and emphasizes long context extension and improved strategy optimization [10][17]. - OpenReasonerZero is the first complete reproduction of reinforcement learning training on a foundational model, showcasing significant results [34][36]. - DAPO explores improvements to GRPO to better adapt to reasoning training, presenting a large-scale open-source LLM reinforcement learning system [48][54]. Group 2: GRPO and Its Characteristics - GRPO is closely related to PPO (Proximal Policy Optimization) and shares similarities with RLOO (REINFORCE Leave One Out), indicating that many leading research works do not utilize GRPO [11][12][9]. - The core understanding is that current RL algorithms are highly similar in implementation, with GRPO being popular but not fundamentally revolutionary [15][6]. - GRPO includes clever modifications specifically for reasoning training rather than traditional RLHF scenarios, focusing on generating multiple answers for reasoning tasks [13][12]. Group 3: Training Techniques and Strategies - Kimi k1.5's training involves supervised fine-tuning (SFT) and emphasizes behavior patterns such as planning, evaluation, reflection, and exploration [23][24]. - The training methods include a sequence strategy that starts with simpler tasks and gradually increases complexity, akin to human learning processes [27][28]. - The paper discusses the importance of data distribution and the quality of prompts in ensuring effective reinforcement learning [22][41]. Group 4: DAPO Improvements - DAPO introduces two distinct clipping hyperparameters to enhance the learning dynamics and efficiency of the model [54][60]. - It also emphasizes dynamic sampling by removing samples with flat rewards from the batch to improve learning speed [63]. - The use of token-level loss rather than per-response loss is proposed to better manage learning dynamics and avoid issues with long responses [64][66]. Group 5: Dr. GRPO Modifications - Dr. GRPO aims to improve learning dynamics by modifying GRPO to achieve stronger performance with shorter generated lengths [76][79]. - The modifications include normalizing advantages across all tokens in a response, which helps in managing the learning signal effectively [80][81]. - The paper highlights the importance of high-quality data engineering in absorbing the effects of these changes, emphasizing the need for a balanced distribution of problem difficulty [82][89].