后训练Scaling Law
Search documents
DeepSeek-R1\Kimi1.5及类强推理模型开发解读
Peking University· 2025-03-05 10:54
Investment Rating - The report does not explicitly state an investment rating for the industry or company Core Insights - DeepSeek-R1 introduces a new paradigm of strong reasoning under reinforcement learning (RL), showcasing significant advancements in reasoning capabilities and long-text processing [4][7] - The model demonstrates exceptional performance in complex tasks, marking a milestone in the open-source community's competition with closed-source models like OpenAI's o1 series [7] - The report highlights the potential of RL-driven models to enhance reasoning abilities without relying on human-annotated supervised fine-tuning [21][56] Summary by Sections Technical Comparison - The report discusses the comparison between STaR-based methods and RL-based methods, emphasizing the advantages of RL in reasoning tasks [3] - It details the innovative RL algorithms used, such as GRPO, which optimize training efficiency and reduce computational costs [49][50] DeepSeek-R1 Analysis - DeepSeek-R1 Zero is built entirely on RL without supervised fine-tuning, showcasing its ability to develop reasoning capabilities autonomously [13][21] - The model's performance metrics indicate strong results in various benchmarks, including AIME 2024 and MATH-500, where it achieved 79.8% and 97.3% respectively, comparable to OpenAI's models [7][15] Insights and Takeaways - The report emphasizes the importance of a robust base model, DeepSeek-V3, which was trained on 671 billion parameters and 14.8 trillion high-quality tokens, enabling significant reasoning capabilities [45][56] - The use of rule-based rewards in training helps avoid reward hacking issues, allowing for automated verification and annotation of reasoning tasks [17][22] Future Directions - The report discusses the potential for further advancements in RL-driven models, suggesting that future training will increasingly focus on RL while still incorporating some supervised fine-tuning [56] - It highlights the need for models to maintain high reasoning performance while ensuring safety and usability in diverse applications [59] Economic and Social Benefits - The exploration of low-cost, high-quality language models is expected to reshape industry dynamics, leading to increased competition and innovation [59] - The report notes that the capital market's volatility is a short-term phenomenon driven by rapid advancements in AI technology, which will lead to a long-term arms race in computational resources [59]
2025年DeepSeek-R1&Kimi 1.5及类强推理模型开发解读报告
Peking University· 2025-03-04 01:35
Investment Rating - The report does not explicitly provide an investment rating for the industry or company discussed Core Insights - DeepSeek-R1 introduces a new paradigm of strong reasoning under reinforcement learning (RL), showcasing significant advancements in reasoning capabilities and long-text processing [4][7] - The model demonstrates exceptional performance in complex tasks, marking a milestone in the open-source community's competition with closed-source models like OpenAI's o1 series [7] - The report emphasizes the importance of RL in enhancing model capabilities, particularly in mathematical reasoning and coding tasks, with DeepSeek-R1 achieving notable scores in various benchmarks [7][59] Summary by Sections Technical Comparison - The report discusses the technical advancements of DeepSeek-R1, including its architecture and the innovative RL algorithms employed, such as GRPO [3][4] - A comparison of performance metrics against other models, highlighting DeepSeek-R1's superior capabilities in various reasoning tasks [6] Insights and Takeaways - The model's ability to self-iterate and enhance its reasoning capabilities through RL is emphasized, showcasing its potential for autonomous learning without reliance on supervised fine-tuning [21][56] - The report outlines the significance of rule-based rewards in the training process, which helps avoid reward hacking issues commonly faced in traditional RL setups [16][23] Future Directions - The report suggests future exploration in enhancing model safety and usability, particularly in generating coherent and clear reasoning outputs [30][59] - It highlights the potential for further advancements in multi-modal reasoning and the integration of synthetic data to overcome data reproduction challenges [30][59] Economic and Social Benefits - The exploration of low-cost, high-quality language models is discussed, emphasizing the shift from model size to computational resources and synthetic data in expanding capabilities [59] - The report notes the potential for increased market activity and innovation driven by accessible AI technologies, which could lead to a more diverse application landscape [59]