Workflow
GRPO算法
icon
Search documents
DeepSeek-V3.2巨「吃」Token,竟然是被GRPO背刺了
3 6 Ke· 2025-12-04 10:38
Core Insights - The release of DeepSeek-V3.2 has generated significant attention in the industry, highlighting both its capabilities and areas needing improvement, particularly in token efficiency and output verbosity [1][2][5]. Token Efficiency - DeepSeek-V3.2 Speciale exhibits poor token consumption efficiency, requiring 77,000 tokens for complex tasks compared to Gemini's 20,000 tokens, indicating over three times the token usage for similar quality outputs [1][5]. - Users have noted that if the token generation speed of DeepSeek-V3.2 Speciale could be improved from approximately 30 tokens per second to around 100 tokens per second, the overall usability and experience would significantly enhance [5]. Output Quality - The Speciale version has been criticized for producing lengthy and verbose outputs, often resulting in incorrect answers, which is attributed to inherent flaws in the GRPO algorithm [2][14]. - The technical report from DeepSeek acknowledges the increased token consumption during inference, with the Speciale version consuming 86 million tokens in benchmark tests, up from 62 million in the previous version [7][14]. Algorithmic Issues - The GRPO algorithm, which has been a standard in reinforcement learning, is identified as a source of bias leading to longer and incorrect responses. This includes length bias, where shorter correct responses receive greater updates, and longer incorrect responses face weaker penalties [18][21]. - While the difficulty bias has been optimized in DeepSeek-V3.2, the length bias remains, potentially contributing to the excessive token consumption observed in the Speciale version [18][21].
DeepSeek-V3.2巨「吃」Token,竟然是被GRPO背刺了
机器之心· 2025-12-04 08:18
Core Insights - The article discusses the release of the DeepSeek-V3.2 model, highlighting its performance issues, particularly in token consumption and output verbosity, which have raised concerns among users and researchers [1][2][6]. Token Consumption and Efficiency - DeepSeek-V3.2 Speciale exhibits inefficient token usage, consuming 77,000 tokens for tasks where Gemini only requires 20,000, indicating over three times the token expenditure for similar quality results [1][6]. - Users have noted that the generation speed of DeepSeek-V3.2 Speciale is approximately 30 tokens per second, and an increase to around 100 tokens per second could significantly enhance usability and experience [6]. Output Quality and Verbosity - The Speciale version tends to produce lengthy and verbose outputs, often resulting in incorrect responses, which is attributed to inherent flaws in the GRPO algorithm [2][15]. - The model's performance in benchmark tests shows that it has a median score of 76.38, with a median difference of 11.07% compared to other models, indicating a notable gap in efficiency [7]. Comparison with Other Models - In benchmark comparisons, DeepSeek-V3.2 Speciale's token consumption during inference has been reported to be significantly higher than its predecessor, with a consumption of 86 million tokens compared to 62 million for the previous version [7][10]. - The model's performance metrics reveal that it lags behind competitors like Gemini-3.0 Pro in terms of output token delay and efficiency [10][12]. Algorithmic Limitations - The GRPO algorithm, which underpins DeepSeek, has been criticized for introducing biases that lead to longer and often incorrect responses, a problem that persists in the latest model [16][20]. - Length bias, a significant issue in the GRPO algorithm, causes the model to generate longer responses even when they are incorrect, which has been identified as a primary reason for the high token consumption in DeepSeek-V3.2 Speciale [20][23]. Future Directions - The developers acknowledge the need for improved token efficiency as a critical area for future research, aiming to balance performance and cost in subsequent model iterations [14][23].
DeepSeek-V3.2被找出bug了:疯狂消耗token,答案还可能出错,研究人员:GRPO老问题没解决
3 6 Ke· 2025-12-04 02:21
Core Insights - DeepSeek-V3.2 has gained significant attention but still exhibits bugs, particularly in token efficiency, which has been a longstanding issue [1][4]. Group 1: Performance Issues - The Speciale version of DeepSeek-V3.2 consumes a higher number of tokens for complex tasks, requiring 77,000 tokens compared to Gemini's 20,000 tokens for the same problem [4]. - The model has a "length bias," where longer incorrect answers are penalized less, leading to the generation of verbose but incorrect responses [8][11]. Group 2: Algorithmic Biases - The GRPO algorithm has two hidden biases: length bias and difficulty bias. The length bias results in longer incorrect answers being favored, while the difficulty bias causes the model to focus excessively on overly simple or overly difficult questions, neglecting those of medium difficulty which are crucial for skill improvement [8][9]. - The core author of the research, Zichen Liu, noted that while the new advantage value calculation has corrected the difficulty bias, the length bias remains unaddressed [10][11]. Group 3: Token Efficiency and Cost - DeepSeek's official report acknowledges that token efficiency is still a challenge for V3.2, as the new models require generating longer trajectories to match the output quality of Gemini-3.0-Pro [14]. - Despite the high token consumption, DeepSeek-V3.2 is priced at only 1/24th of GPT-5, making it relatively acceptable in terms of cost [14].
DeepSeek-V3.2被找出bug了:疯狂消耗token,答案还可能出错,研究人员:GRPO老问题没解决
量子位· 2025-12-03 09:05
Core Viewpoint - DeepSeek-V3.2 has gained significant attention but has been found to have issues, particularly with token consumption during complex tasks, leading to longer and potentially incorrect answers [1][4][5]. Group 1: Token Consumption Issues - DeepSeek-V3.2's Speciale version consumes more tokens compared to competitors, using 77,000 tokens for certain tasks while Gemini only uses 20,000 tokens [5]. - The model's reliance on the GRPO algorithm has led to a "length bias," where longer incorrect answers are less penalized, resulting in the generation of "long and wrong" responses [10][11]. Group 2: Hidden Biases in GRPO Algorithm - The GRPO algorithm has two hidden biases: length bias and difficulty bias. The length bias results in longer incorrect answers being favored, while the difficulty bias causes the model to focus excessively on overly simple or overly difficult questions, neglecting medium-difficulty questions that are crucial for skill improvement [10][12]. - Despite attempts to address these biases, the length bias remains a challenge, as acknowledged in DeepSeek's technical report [15][13]. Group 3: Cost and Resource Considerations - DeepSeek-V3.2's output cost is significantly lower than that of GPT-5, at only 1/24 of the price, which may make it more acceptable despite its token efficiency issues [17]. - The model's context length of 128K has not been updated for a long time, which may be related to limited GPU resources [18].
多模态大模型强化学习训练框架 - EasyR1代码走读(GRPO)
自动驾驶之心· 2025-07-15 12:30
Core Insights - The article discusses the exploration of the EasyR1 framework for multi-modal reinforcement learning, particularly focusing on its implementation and configuration for training models like Qwen2.5-VL [1][4][6]. Group 1: Framework Overview - EasyR1 is derived from the verl framework and is designed for language-based reinforcement learning [1][6]. - The code version referenced is approximately from June 10, indicating ongoing updates and improvements [1]. Group 2: Configuration Details - The configuration file is structured into four main categories: data, algorithm, worker, and trainer, with specific parameters outlined for each [6][11]. - Data configurations include paths for training and validation files, maximum prompt and response lengths, and batch sizes for training iterations [9][10]. - Algorithm configurations specify parameters for the advantage estimator, discount factors, and KL divergence settings [11][13]. Group 3: Training Workflow - The training process is initiated through a main script that sets up the data loaders and begins the training loop [42][43]. - The workflow includes steps for preparing data, generating sequences, and computing rewards, with specific attention to balancing batch sizes across distributed processes [46][50][64]. - The article emphasizes the importance of handling multi-modal data and ensuring that the training process accommodates various input types [65][66]. Group 4: Data Handling - The dataset must include specific keys such as problem, answer, and images, formatted in JSON for compatibility with the loading functions [40][41]. - The data loading process supports multiple file formats and is designed to create a seamless pipeline for training [41][32]. Group 5: Model Update Mechanism - The article outlines the mechanism for updating the actor model, detailing how policy loss is computed and how gradients are managed during training [82][86]. - It highlights the significance of KL divergence in the training process, particularly in relation to the reference model [71][80].
DeepSeek用的GRPO有那么特别吗?万字长文分析四篇精品论文
机器之心· 2025-05-24 03:13
Core Insights - The article discusses recent advancements in reasoning models, particularly focusing on GRPO and its improved algorithms, highlighting the rapid evolution of AI in the context of reinforcement learning and reasoning [1][2][3]. Group 1: Key Papers and Models - Kimi k1.5 is a newly released reasoning model that employs reinforcement learning techniques and emphasizes long context extension and improved strategy optimization [10][17]. - OpenReasonerZero is the first complete reproduction of reinforcement learning training on a foundational model, showcasing significant results [34][36]. - DAPO explores improvements to GRPO to better adapt to reasoning training, presenting a large-scale open-source LLM reinforcement learning system [48][54]. Group 2: GRPO and Its Characteristics - GRPO is closely related to PPO (Proximal Policy Optimization) and shares similarities with RLOO (REINFORCE Leave One Out), indicating that many leading research works do not utilize GRPO [11][12][9]. - The core understanding is that current RL algorithms are highly similar in implementation, with GRPO being popular but not fundamentally revolutionary [15][6]. - GRPO includes clever modifications specifically for reasoning training rather than traditional RLHF scenarios, focusing on generating multiple answers for reasoning tasks [13][12]. Group 3: Training Techniques and Strategies - Kimi k1.5's training involves supervised fine-tuning (SFT) and emphasizes behavior patterns such as planning, evaluation, reflection, and exploration [23][24]. - The training methods include a sequence strategy that starts with simpler tasks and gradually increases complexity, akin to human learning processes [27][28]. - The paper discusses the importance of data distribution and the quality of prompts in ensuring effective reinforcement learning [22][41]. Group 4: DAPO Improvements - DAPO introduces two distinct clipping hyperparameters to enhance the learning dynamics and efficiency of the model [54][60]. - It also emphasizes dynamic sampling by removing samples with flat rewards from the batch to improve learning speed [63]. - The use of token-level loss rather than per-response loss is proposed to better manage learning dynamics and avoid issues with long responses [64][66]. Group 5: Dr. GRPO Modifications - Dr. GRPO aims to improve learning dynamics by modifying GRPO to achieve stronger performance with shorter generated lengths [76][79]. - The modifications include normalizing advantages across all tokens in a response, which helps in managing the learning signal effectively [80][81]. - The paper highlights the importance of high-quality data engineering in absorbing the effects of these changes, emphasizing the need for a balanced distribution of problem difficulty [82][89].