推理时扩展
Search documents
DeepSeek-R2为什么还没发?
猿大侠· 2025-06-27 14:57
Core Viewpoint - The release of DeepSeek-R2 has been delayed due to CEO Liang Wenfeng's dissatisfaction with its performance and a shortage of Nvidia H20 chips, which are critical for its development [1][2][4]. Group 1: Development Timeline - The anticipation for R2 began after the release of the DeepSeek-V3 model in December last year, which was considered a benchmark for cost-performance [5]. - Initial expectations suggested that R2 would be launched in April, following the upgrade of V3 on March 24 [11]. - Despite the release of a paper on inference scaling in April, there has been no official update on R2's launch [12][16]. Group 2: Technical Specifications - R1's training utilized 30,000 H20 chips, 10,000 H800 chips, and 10,000 H100 chips, indicating the significant computational resources required for R2 [3]. - Leaked parameters for R2 suggested it would have 1.2 trillion parameters and utilize 5.2 petabytes of training data, raising questions about its hardware requirements [17]. Group 3: Community Reactions - Following the news of the delay, community responses varied, with some expressing belief that the delay would be worthwhile, while others speculated that R2 might wait for the release of V4 [26][28].
DeepSeek前脚发新论文,奥特曼立马跟上:GPT-5就在几个月后啊
量子位· 2025-04-05 04:45
Core Viewpoint - The article discusses the recent developments in AI, particularly focusing on DeepSeek's new research paper on inference-time scaling and OpenAI's announcement regarding the release timeline of their upcoming models [2][4][12]. Group 1: OpenAI's Model Release Updates - OpenAI plans to release o3 and o4-mini in a few weeks, with GPT-5 expected to be released in a few months, promising better performance than initially anticipated [3][4]. - The delay in GPT-5's release is attributed to the challenges in integrating all components effectively, as OpenAI aims to ensure sufficient capability to meet expected demand [6][8]. Group 2: DeepSeek's Research Contributions - DeepSeek, in collaboration with Tsinghua University, introduced a new method called SPCT (Self-Principled Critique Tuning) aimed at enhancing reward modeling in reinforcement learning [10][12]. - The research addresses limitations in existing reward models, particularly their flexibility and accuracy in handling complex tasks [14][16]. - SPCT consists of three core technical points: 1. Generative Reward Model (GRM) that generates critiques instead of scalar values, allowing for flexible input and inference-time scaling [20][21]. 2. Online reinforcement learning to dynamically generate high-quality principles and critiques, improving reward quality [22]. 3. Inference-time scaling techniques that involve sampling diverse principles and critiques to enhance the reward space [23][24]. Group 3: Performance Metrics - DeepSeek's GRM-27B model significantly outperformed baseline methods in various benchmarks, with Reward Bench accuracy increasing from 86.0% to 90.4% through inference-time scaling [27][28]. - The results indicate that inference-time scaling is effective in general reward modeling, surpassing training-time scaling [28].