Workflow
GRPO Algorithm
icon
Search documents
DeepSeek 创始人梁文锋在《自然》杂志回应质疑,R1 训练真 29.4 万美金
Xin Lang Cai Jing· 2025-09-19 00:03
Core Insights - DeepSeek-R1 has made a significant impact in the AI field by being featured on the cover of Nature, highlighting its innovative approach to enhancing reasoning capabilities in large language models (LLMs) through reinforcement learning (RL) [1][3][5]. Group 1: Achievements and Recognition - The paper "DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning" was published in January and has now been recognized on the cover of a leading journal, Nature [3]. - DeepSeek-R1 has become the most popular model on Hugging Face after its open-source release, achieving over 10.9 million downloads [5]. - The training cost for DeepSeek-R1 was remarkably low at $294,000, which is significantly less than the costs incurred by competitors like OpenAI and Google [6][7]. Group 2: Training Methodology - DeepSeek-R1 utilizes a novel RL framework that focuses solely on the task format and reward signals based on the correctness of the final answer, allowing for a more organic development of reasoning capabilities [10]. - The model's reasoning accuracy improved dramatically from 15.6% to 77.9% during training, with a peak accuracy of 86.7% when combined with "self-consistent decoding" techniques [10]. Group 3: Self-Evolution and Advanced Strategies - The model exhibited self-evolution behaviors, such as increasing the length of generated text and employing advanced reasoning strategies like self-reflection and systematic exploration of alternative solutions [12][14]. - A notable "Aha Moment" was observed when the model began using the word "wait" more frequently, indicating a shift in its reasoning approach [15][17]. Group 4: Future Development Plans - To address the limitations of DeepSeek-R1, a multi-stage refinement plan has been initiated, which includes cold starting with high-quality conversational data, followed by multiple rounds of RL and supervised fine-tuning [18][19]. - The model's performance has improved by 17%-25% on various benchmarks after undergoing this multi-stage training process [21]. Group 5: Algorithm and Reward System - DeepSeek employs the GRPO (Group Relative Policy Optimization) algorithm, which optimizes model performance by evaluating a group of answers rather than a single best answer, thus reducing resource consumption while maintaining stability [23][24]. - A dual reward system has been established, incorporating both rule-based rewards for reasoning tasks and model-based rewards for general tasks, ensuring the model aligns with human preferences while maintaining its reasoning capabilities [25][26]. Group 6: Challenges and Limitations - Despite its advancements, DeepSeek-R1 faces challenges in structured outputs and tool usage, and it is sensitive to prompts, which limits its effectiveness in complex scenarios [35][37]. - The potential for reward hacking exists, particularly in subjective tasks, which could undermine the model's performance if the reward signals are not robust [37].