DeepSeek前脚发新论文，奥特曼立马跟上：GPT-5就在几个月后啊

Core Viewpoint - The article discusses the recent developments in AI, particularly focusing on DeepSeek's new research paper on inference-time scaling and OpenAI's announcement regarding the release timeline of their upcoming models [2][4][12]. Group 1: OpenAI's Model Release Updates - OpenAI plans to release o3 and o4-mini in a few weeks, with GPT-5 expected to be released in a few months, promising better performance than initially anticipated [3][4]. - The delay in GPT-5's release is attributed to the challenges in integrating all components effectively, as OpenAI aims to ensure sufficient capability to meet expected demand [6][8]. Group 2: DeepSeek's Research Contributions - DeepSeek, in collaboration with Tsinghua University, introduced a new method called SPCT (Self-Principled Critique Tuning) aimed at enhancing reward modeling in reinforcement learning [10][12]. - The research addresses limitations in existing reward models, particularly their flexibility and accuracy in handling complex tasks [14][16]. - SPCT consists of three core technical points: 1. Generative Reward Model (GRM) that generates critiques instead of scalar values, allowing for flexible input and inference-time scaling [20][21]. 2. Online reinforcement learning to dynamically generate high-quality principles and critiques, improving reward quality [22]. 3. Inference-time scaling techniques that involve sampling diverse principles and critiques to enhance the reward space [23][24]. Group 3: Performance Metrics - DeepSeek's GRM-27B model significantly outperformed baseline methods in various benchmarks, with Reward Bench accuracy increasing from 86.0% to 90.4% through inference-time scaling [27][28]. - The results indicate that inference-time scaling is effective in general reward modeling, surpassing training-time scaling [28].