Group Relative Policy Optimization (GRPO) - filings, earnings calls, financial reports, news

Group Relative Policy Optimization (GRPO)

Search documents

训练成本29.4万美元，DeepSeek-R1登Nature封面，首个通过权威期刊同行评审的主流大模型获好评

3 6 Ke· 2025-09-18 07:55

Core Insights - DeepSeek-R1's research results have been published in Nature, marking it as the first mainstream large model to undergo peer review by a reputable journal, which has sparked significant discussion in the academic community [1][14][17] - The training cost of DeepSeek-R1 is reported to be only $294,000, significantly lower than the industry standard of tens of millions for leading models, despite an investment of approximately $6 million in the foundational LLM [1][2][17] Training Costs - The training costs for DeepSeek-R1 are broken down as follows: - DeepSeek-R1-Zero: $202,000 - SFT data creation: $10,000 - DeepSeek-R1: $82,000 - Total: $294,000 - The training utilized 648 H800 GPUs over approximately 198 hours for DeepSeek-R1-Zero and around 80 hours for DeepSeek-R1 [2] Reinforcement Learning and Reasoning Capabilities - The model employs Group Relative Policy Optimization (GRPO) to enhance reasoning capabilities without traditional supervised fine-tuning, allowing for more exploratory learning [3][4] - DeepSeek-R1-Zero demonstrates complex reasoning behaviors, generating longer responses that incorporate verification and exploration of different solutions [4][6] Performance Metrics - DeepSeek-R1-Zero achieved a pass@1 score of 77.9% in the AIME 2024 math competition, with further improvements to 86.7% using self-consistent decoding strategies, surpassing human average performance [6][8] - The model also excelled in programming competitions and graduate-level questions in biology, physics, and chemistry, validating the effectiveness of reinforcement learning in enhancing reasoning capabilities [6] Development Pipeline - The development of DeepSeek-R1 involved multiple stages, starting from data collection based on human-like dialogue to reinforcement learning and sampling, ultimately enhancing the model's utility and safety [9][11] - Experimental results indicate significant improvements in instruction execution across various development stages, with DeepSeek-R1 outperforming its predecessors in benchmark tests [11][13] Industry Impact - The peer review of DeepSeek-R1 is seen as a positive trend for AI research, promoting transparency and standardization in the field, which has been lacking for many mainstream AI models [14][16][17]

Seek .(US:SKLTY)

Group Relative Policy Optimization (GRPO)

Chain-of-Thought (CoT)

Artificial Intelligence

DeepSeek-R1

DeepSeek-R1-Zero

Group Relative Policy Optimization (GRPO)

Chain-of-Thought (CoT)

Artificial Intelligence

DeepSeek-R1

DeepSeek-R1-Zero