Workflow
推理能力
icon
Search documents
梁文锋发表Nature封面论文:揭开DeepSeek-R1背后的科学原理——强化学习激励大模型推理能力
生物世界· 2025-09-18 01:44
Core Viewpoint - The article discusses the development and capabilities of DeepSeek-R1, a reasoning model that significantly reduces computational costs while enhancing reasoning abilities in large language models (LLMs) through pure reinforcement learning [1][2]. Group 1: Model Development and Training - DeepSeek-R1 was launched by a startup in Hangzhou, China, on January 20, 2025, and has gained global attention for its strong reasoning capabilities and low computational requirements [1]. - The training cost for DeepSeek-R1 was only $294,000, which is significantly lower than similar models that often cost tens of millions [2]. - The model employs a pure reinforcement learning approach, minimizing reliance on human-annotated reasoning paths, which allows for more autonomous exploration of reasoning capabilities [6][10]. Group 2: Performance and Capabilities - DeepSeek-R1-Zero, a precursor to DeepSeek-R1, demonstrated remarkable performance improvements in reasoning tasks, achieving an average pass@1 score of 77.9% in the American Mathematics Invitational Exam (AIME) 2024, up from 15.6% [17]. - The model also excelled in programming competitions and graduate-level problems in biology, physics, and chemistry, showcasing its versatility [19]. - The research indicates that advanced reasoning behaviors, such as self-validation and reflection, emerged organically during the reinforcement learning process [29]. Group 3: Challenges and Limitations - Despite its strengths, DeepSeek-R1-Zero faces challenges such as poor readability and language mixing issues, particularly when responding in both English and Chinese [21]. - The model's performance in broader domains like writing and open-domain Q&A remains limited due to its focus on reasoning tasks during training [22]. - The article highlights potential ethical risks associated with enhanced reasoning capabilities, including vulnerability to jailbreak attacks and the generation of dangerous content [27][28].
揭秘:OpenAI是如何发展出推理模型的?
Hua Er Jie Jian Wen· 2025-08-04 07:02
Core Insights - OpenAI's journey towards developing general AI agents began unexpectedly with a focus on mathematics, which laid the groundwork for their reasoning capabilities [2][3] - The success of ChatGPT was seen as a surprising outcome of this foundational work, which was initially low-profile but ultimately led to significant consumer interest [2][3] - OpenAI's CEO Sam Altman envisions a future where users can simply state their needs, and AI will autonomously complete tasks, highlighting the potential benefits of AI agents [3] Group 1: Mathematical Foundations - The initial focus on mathematics was crucial as it serves as a testbed for logical reasoning, indicating that a model capable of solving complex math problems possesses foundational reasoning abilities [2][3] - OpenAI's model recently won a gold medal at the International Mathematical Olympiad, showcasing the effectiveness of their reasoning capabilities developed through mathematical challenges [3] Group 2: Breakthrough Innovations - In 2023, OpenAI achieved a significant leap in reasoning capabilities through an innovative approach known as "Strawberry," which combined large language models, reinforcement learning, and test-time computation [4][5] - This combination led to the development of a new method called "Chain-of-Thought," allowing models to demonstrate their reasoning processes rather than just providing answers [6] Group 3: Nature of AI Reasoning - OpenAI researchers are pragmatic about the nature of AI reasoning, focusing on the effectiveness of models in completing complex tasks rather than strictly adhering to human-like reasoning processes [7] - The company's culture emphasizes a bottom-up approach to research, prioritizing breakthrough ideas over short-term product gains, which has enabled significant investments in reasoning models [7] Group 4: Future Directions - Current AI agents show promise in well-defined tasks but struggle with more subjective tasks, indicating a need for advancements in training models for these areas [8] - OpenAI is exploring new universal reinforcement learning techniques to enable models to learn skills that are difficult to verify, as demonstrated by their IMO gold medal model [8] Group 5: Competitive Landscape - OpenAI, once the leader in the AI industry, now faces strong competition from companies like Google, Anthropic, xAI, and Meta, raising questions about its ability to maintain its lead in the race towards advanced AI agents [9]
OpenAI 研究员 Noam Brown:Mid-training 是新的 pre-training
海外独角兽· 2025-07-02 11:03
Core Insights - The article discusses the emergence of reasoning capabilities in AI models, highlighting a shift from mere pattern matching to complex cognitive reasoning, which is essential for scientific discovery and decision-making [4][5]. Group 1: Reasoning as an Emergent Capability - Reasoning is an emergent ability that models can only benefit from once pre-training reaches a certain level [5][11]. - The analogy of "fast thinking and slow thinking" is used to explain the relationship between non-reasoning and reasoning models, where the former corresponds to intuitive responses and the latter to deliberate reasoning [8][11]. - The performance of models in multi-modal tasks depends on their ability to integrate complex information and logical reasoning [12][13]. Group 2: Need for a Universal Reasoning Paradigm - Achieving superintelligence requires a universal reasoning paradigm, as merely scaling pre-training is insufficient [20][21]. - OpenAI's leadership recognized the need for a shift towards reasoning paradigms and reinforcement learning, leading to significant resource allocation in these areas [21][24]. Group 3: Efficient Data Utilization through Reinforcement Learning - Reinforcement learning can enhance the efficiency of data usage, which is crucial as data becomes scarcer than computational power [25]. - Current machine learning models require significantly more samples than humans to learn new concepts, highlighting the need for improved sample efficiency [25][26]. Group 4: Non-Consensus Views on Reasoning Ability - Reasoning is not limited to tasks with clear reward functions; it can also excel in subjective fields where results are harder to quantify [33]. - The alignment of AI with user preferences is critical, and reasoning capabilities can help achieve this alignment while mitigating ethical risks [34][35]. Group 5: Bottlenecks in Test-Time Compute Development - Test-time compute faces cost limitations similar to those encountered during pre-training scaling, where increased model size leads to exponentially rising costs [36]. - The absolute time constraints on model responses hinder the speed of experimental iterations, impacting research efficiency [37][38]. Group 6: Mid-Training as a New Pre-Training Phase - Mid-training is introduced as a phase that adds new capabilities to models before the completion of pre-training, enhancing their generalization and practicality [40][41]. - OpenAI has adopted mid-training strategies in its model training processes to improve alignment and safety [41][42]. Group 7: Insights from The Bitter Lesson for Multi-Agent Systems - The concept of multi-agent systems may lead to the emergence of an "AI civilization" through long-term collaboration and competition among AI agents [44]. - Noam's team is exploring a principled research path that contrasts with traditional heuristic-based approaches in multi-agent research [45][46].
公开模型一切,优于DeepSeek-R1,英伟达开源Llama-Nemotron家族
机器之心· 2025-05-06 08:04
Core Viewpoint - The rapid development of large models has made reasoning ability a key indicator of model intelligence, with inference efficiency becoming a critical limiting factor for model deployment and performance [2][3]. Group 1: Model Overview - NVIDIA has launched the Llama-Nemotron series, an open family of large models designed for efficient reasoning, featuring excellent inference capabilities and an enterprise-friendly open license [3][5]. - The series includes three model sizes: Nano (8B), Super (49B), and Ultra (253B), along with an independent variant UltraLong (8B) that supports long context [4][5]. - The models are the first open-source models to support dynamic inference switching, allowing users to toggle between standard chat mode and reasoning mode, enhancing interaction flexibility [6]. Group 2: Model Training and Optimization - The Llama-Nemotron models utilize a multi-stage post-training process to enhance performance on reasoning and non-reasoning tasks, employing supervised fine-tuning and reinforcement learning techniques [9]. - The Puzzle framework is used for efficient reasoning optimization, transforming large language models into hardware-efficient variants while maintaining performance [12][15]. - LN-Super and LN-Ultra models achieve significant throughput improvements, with LN-Super showing a 5x increase in inference throughput compared to Llama 3.3-70B-Instruct [19]. Group 3: Performance Metrics - LN-Ultra demonstrates superior performance in key benchmarks, achieving scores such as 88.1 in MMLU and 80.4 in MATH500, surpassing its predecessors [25][24]. - The models are designed to meet specific deployment constraints, such as supporting up to 3 million cached tokens in FP8 precision for LN-Ultra [21]. Group 4: Reinforcement Learning and Instruction Following - The models incorporate a "detailed thinking on/off" instruction mechanism to enhance flexibility in reasoning depth and response style, improving user interaction [27]. - LN-Ultra's performance is further enhanced through large-scale reinforcement learning, allowing it to exceed the capabilities of its teacher model [31][39]. - The training process for LN-Ultra involved approximately 140,000 H100 GPU hours, focusing on optimizing reasoning capabilities and instruction-following abilities [32][41].
从论文中积累复现 R1 的 insight
理想TOP2· 2025-04-30 13:04
Core Viewpoint - The article discusses advancements in reinforcement learning (RL) techniques for large language models (LLMs), emphasizing the need for improved algorithms, reward design, and training strategies to enhance reasoning capabilities and model performance. Group 1: Algorithm Improvements - Current algorithms have significant room for improvement, with the introduction of Dr. GRPO addressing issues in GRPO related to response length bias and problem difficulty bias, leading to better token efficiency and reasoning performance [3][4]. - The DAPO method is proposed to tackle entropy collapse and sample efficiency issues in GRPO and PPO, enhancing training stability and efficiency through techniques like Clip-Higher and dynamic sampling [6]. Group 2: Training Strategies - Larger training batch sizes (e.g., TBS = 1024) enhance training efficiency and stability, while on-policy strategies are more advantageous than off-policy ones for model exploration [6]. - Increasing rollout times (e.g., n = 64) improves training outcomes, encouraging longer responses, and a dynamic annealing strategy for KL penalty is recommended to balance exploration and stability [6]. Group 3: Reward Design - Early reward design flaws led to various reward hacking behaviors, necessitating a refined reward system that includes format and answer rewards to constrain model behavior and avoid cheating [6]. - The relationship between response length and reasoning ability is not causal; longer responses may provide more exploration space but do not directly enhance reasoning performance [6]. Group 4: Generalization and Learning - RL is more effective than supervised fine-tuning (SFT) in promoting generalization across tasks, suggesting that reasoning can be a universal capability stimulated by specific tasks [7][9]. - Combining rule-based rewards with reward model-based rewards is beneficial, especially in tasks without clear answers, to enhance learning and mitigate reward hacking [9].
影响推理能力的关键脑区确定
Ke Ji Ri Bao· 2025-04-20 23:51
Core Findings - Researchers from University College London identified key brain regions essential for logical thinking and problem-solving, enhancing understanding of human reasoning capabilities [1] Group 1: Research Methodology - The study utilized "lesion-deficit mapping," an effective method for locating brain functions, involving 247 patients with unilateral focal brain damage in the left or right frontal lobes, alongside a control group of 81 healthy individuals [1] Group 2: Testing and Results - Two new tests were developed to assess reasoning abilities: a verbal analogy reasoning task and a non-verbal deductive reasoning task, with results indicating that patients with right frontal lobe damage performed significantly worse, making approximately 15% more errors than other patients and healthy individuals [2] - The study found a close relationship between the right frontal brain network involved in reasoning and the network critical for fluid intelligence, suggesting a shared brain region plays a key role in both reasoning and fluid intelligence [2]
GPT-5 有了雏形;OpenAI 和 Manus 研发 Agent 的经验;中国大公司扩大算力投资丨 AI 月报
晚点LatePost· 2025-03-08 12:17
2025 年 2 月的全球 AI 重要趋势。 文 丨 贺乾明 2025 年 2 月的 AI 月报,你会看到: 硅谷巨头的新共识:推理能力是大模型的一部分 OpenAI 和 Manus 的 Agent 开发经验 DeepSeek 推动中国大公司加大算力投入,阿里、字节两家加起来,今年就超过 2000 亿 3 家售价过亿的 AI 公司和 23 家获得超过 5000 万美元融资的 AI 公司 OpenAI 时薪 100 美元招专家生产数据提高模型能力 这一期月报中,我们开始邀请研究者、创业者和投资人提供一手视角的对每月 AI 趋势和标志性事件的评述和 洞察。 晚点 AI 月报,每月选取最值得你知道的 AI 信号。 以下是我们第 4 期 AI 月报,欢迎大家在留言区补充我们没有提到的重要趋势。 技术丨GPT-5 雏形出现,行业新共识诞生 DeepSeek 带来的冲击波继续扩散,全球大模型公司陷入混战:不论是马斯克用超过 10 万张 GPU 训练 的 Grok 3,还是 OpenAI 可能投入 10 亿美元训练的 GPT-4.5,或是 Anthropic 融合推理(reasoning) 能力的最新模型 Claude 3 ...