清库存,DeepSeek突然补全R1技术报告,训练路径首次详细公开
Seek .Seek .(US:SKLTY) 3 6 Ke·2026-01-09 03:12

Core Insights - DeepSeek has released an updated version of its research paper on the R1 model, adding 64 pages of technical details, significantly enhancing the original content [4][25] - The new version emphasizes the implementation details of the R1 model, showcasing a systematic approach to its training process [4][6] Summary by Sections Paper Update - The updated paper has expanded from 22 pages to 86 pages, providing a comprehensive view of the R1 model's training and operational details [4][25] - The new version includes a detailed breakdown of the training process, which is divided into four main steps: cold start, inference-oriented reinforcement learning (RL), rejection sampling and fine-tuning, and alignment-oriented RL [6][9] Training Process - The cold start phase utilizes thousands of CoT (Chain of Thought) data to perform supervised fine-tuning (SFT) [6] - The inference-oriented RL phase enhances model capabilities while introducing language consistency rewards to address mixed-language issues [6] - The rejection sampling and fine-tuning phase incorporates both reasoning and general data to improve the model's writing and reasoning abilities [6] - The alignment-oriented RL phase focuses on refining the model's usefulness and safety to align more closely with human preferences [6] Safety Measures - DeepSeek has implemented a risk control system to enhance the safety of the R1 model, which includes a dataset of 106,000 prompts to evaluate model responses based on predefined safety criteria [9][10] - The safety reward model employs a point-wise training method to distinguish between safe and unsafe responses, with training hyperparameters aligned with the usefulness reward model [9] - The risk control system operates through two main processes: potential risk dialogue filtering and model-based risk review [9][10] Performance Metrics - The introduction of the risk control system has led to a significant improvement in the model's safety performance, with R1 achieving benchmark scores comparable to leading models [14] - DeepSeek has developed an internal safety evaluation dataset categorized into four main categories and 28 subcategories, totaling 1,120 questions [19] Team Stability - The core contributors to the DeepSeek team have largely remained intact, with only five out of over 100 authors having left, indicating strong team retention in a competitive AI industry [21][24] - Notably, a previously departed author has returned to the team, highlighting a positive team dynamic compared to other companies in the sector [24]