推理能力
Search documents
公开模型一切,优于DeepSeek-R1,英伟达开源Llama-Nemotron家族
机器之心· 2025-05-06 08:04
Core Viewpoint - The rapid development of large models has made reasoning ability a key indicator of model intelligence, with inference efficiency becoming a critical limiting factor for model deployment and performance [2][3]. Group 1: Model Overview - NVIDIA has launched the Llama-Nemotron series, an open family of large models designed for efficient reasoning, featuring excellent inference capabilities and an enterprise-friendly open license [3][5]. - The series includes three model sizes: Nano (8B), Super (49B), and Ultra (253B), along with an independent variant UltraLong (8B) that supports long context [4][5]. - The models are the first open-source models to support dynamic inference switching, allowing users to toggle between standard chat mode and reasoning mode, enhancing interaction flexibility [6]. Group 2: Model Training and Optimization - The Llama-Nemotron models utilize a multi-stage post-training process to enhance performance on reasoning and non-reasoning tasks, employing supervised fine-tuning and reinforcement learning techniques [9]. - The Puzzle framework is used for efficient reasoning optimization, transforming large language models into hardware-efficient variants while maintaining performance [12][15]. - LN-Super and LN-Ultra models achieve significant throughput improvements, with LN-Super showing a 5x increase in inference throughput compared to Llama 3.3-70B-Instruct [19]. Group 3: Performance Metrics - LN-Ultra demonstrates superior performance in key benchmarks, achieving scores such as 88.1 in MMLU and 80.4 in MATH500, surpassing its predecessors [25][24]. - The models are designed to meet specific deployment constraints, such as supporting up to 3 million cached tokens in FP8 precision for LN-Ultra [21]. Group 4: Reinforcement Learning and Instruction Following - The models incorporate a "detailed thinking on/off" instruction mechanism to enhance flexibility in reasoning depth and response style, improving user interaction [27]. - LN-Ultra's performance is further enhanced through large-scale reinforcement learning, allowing it to exceed the capabilities of its teacher model [31][39]. - The training process for LN-Ultra involved approximately 140,000 H100 GPU hours, focusing on optimizing reasoning capabilities and instruction-following abilities [32][41].
从论文中积累复现 R1 的 insight
理想TOP2· 2025-04-30 13:04
Core Viewpoint - The article discusses advancements in reinforcement learning (RL) techniques for large language models (LLMs), emphasizing the need for improved algorithms, reward design, and training strategies to enhance reasoning capabilities and model performance. Group 1: Algorithm Improvements - Current algorithms have significant room for improvement, with the introduction of Dr. GRPO addressing issues in GRPO related to response length bias and problem difficulty bias, leading to better token efficiency and reasoning performance [3][4]. - The DAPO method is proposed to tackle entropy collapse and sample efficiency issues in GRPO and PPO, enhancing training stability and efficiency through techniques like Clip-Higher and dynamic sampling [6]. Group 2: Training Strategies - Larger training batch sizes (e.g., TBS = 1024) enhance training efficiency and stability, while on-policy strategies are more advantageous than off-policy ones for model exploration [6]. - Increasing rollout times (e.g., n = 64) improves training outcomes, encouraging longer responses, and a dynamic annealing strategy for KL penalty is recommended to balance exploration and stability [6]. Group 3: Reward Design - Early reward design flaws led to various reward hacking behaviors, necessitating a refined reward system that includes format and answer rewards to constrain model behavior and avoid cheating [6]. - The relationship between response length and reasoning ability is not causal; longer responses may provide more exploration space but do not directly enhance reasoning performance [6]. Group 4: Generalization and Learning - RL is more effective than supervised fine-tuning (SFT) in promoting generalization across tasks, suggesting that reasoning can be a universal capability stimulated by specific tasks [7][9]. - Combining rule-based rewards with reward model-based rewards is beneficial, especially in tasks without clear answers, to enhance learning and mitigate reward hacking [9].
影响推理能力的关键脑区确定
Ke Ji Ri Bao· 2025-04-20 23:51
Core Findings - Researchers from University College London identified key brain regions essential for logical thinking and problem-solving, enhancing understanding of human reasoning capabilities [1] Group 1: Research Methodology - The study utilized "lesion-deficit mapping," an effective method for locating brain functions, involving 247 patients with unilateral focal brain damage in the left or right frontal lobes, alongside a control group of 81 healthy individuals [1] Group 2: Testing and Results - Two new tests were developed to assess reasoning abilities: a verbal analogy reasoning task and a non-verbal deductive reasoning task, with results indicating that patients with right frontal lobe damage performed significantly worse, making approximately 15% more errors than other patients and healthy individuals [2] - The study found a close relationship between the right frontal brain network involved in reasoning and the network critical for fluid intelligence, suggesting a shared brain region plays a key role in both reasoning and fluid intelligence [2]
GPT-5 有了雏形;OpenAI 和 Manus 研发 Agent 的经验;中国大公司扩大算力投资丨 AI 月报
晚点LatePost· 2025-03-08 12:17
2025 年 2 月的全球 AI 重要趋势。 文 丨 贺乾明 2025 年 2 月的 AI 月报,你会看到: 硅谷巨头的新共识:推理能力是大模型的一部分 OpenAI 和 Manus 的 Agent 开发经验 DeepSeek 推动中国大公司加大算力投入,阿里、字节两家加起来,今年就超过 2000 亿 3 家售价过亿的 AI 公司和 23 家获得超过 5000 万美元融资的 AI 公司 OpenAI 时薪 100 美元招专家生产数据提高模型能力 这一期月报中,我们开始邀请研究者、创业者和投资人提供一手视角的对每月 AI 趋势和标志性事件的评述和 洞察。 晚点 AI 月报,每月选取最值得你知道的 AI 信号。 以下是我们第 4 期 AI 月报,欢迎大家在留言区补充我们没有提到的重要趋势。 技术丨GPT-5 雏形出现,行业新共识诞生 DeepSeek 带来的冲击波继续扩散,全球大模型公司陷入混战:不论是马斯克用超过 10 万张 GPU 训练 的 Grok 3,还是 OpenAI 可能投入 10 亿美元训练的 GPT-4.5,或是 Anthropic 融合推理(reasoning) 能力的最新模型 Claude 3 ...