Workflow
强化学习
icon
Search documents
挑战强化学习后训练霸权!全新无监督方法仅需1条数据+10步优化
量子位· 2025-06-01 03:40
Ubiquant团队 投稿 量子位 | 公众号 QbitAI 无需标注数据、无需繁琐奖励设计,只用10步就能见效—— 「熵最小化」或许比强化学习更适合大语言模型快速升级 。 强化学习(RL)近年来在大语言模型(LLM)的微调中大获成功,但高昂的数据标注成本、复杂的奖励设计和漫长的训练周期,成为制约RL 进一步应用的瓶颈。 Ubiquant研究团队提出了一种极为简单有效的无监督方法——One Shot熵最小化(Entropy Minimization,EM),仅用一条无标签数据, 训练10步内即可显著提升LLM性能,甚至超过使用成千上万数据的RL方法。 一、从RL到EM:LLM微调的困境与新思路 当前,大语言模型(LLM)在经过海量数据预训练后,展现出了惊人的通用能力。然而,要让模型在特定、复杂的推理任务(例如数学、物理 或编程)上达到顶尖水平,后训练(post-training)主流后训练方法是采用强化学习(RL),特别是结合可验证奖励的强化学习(RLVR)。 尽管基于RL的微调在提升模型性能上取得了显著进展,但其过程却面临着一系列明显的弊端,使得这种方法成本巨大且过程繁琐。 相较之下,熵最小化(EM)提出了 ...
见证历史!DeepSeek 跃居全球第二 AI 实验室,R1 登顶开源王座,R2 全网催更
程序员的那些事· 2025-06-01 02:04
Core Viewpoint - DeepSeek has officially announced the completion of the R1-0528 upgrade, which significantly enhances its model performance, making it a leading open-source AI model and the second-largest AI laboratory globally [1][9][46]. Performance Enhancements - The upgraded DeepSeek-R1-0528 model exhibits performance comparable to top models like o3 and Gemini 2.5 Pro in various benchmark tests, particularly in mathematics, programming, and general logic [2][15]. - The model's accuracy in complex reasoning tasks has improved significantly, with AIME 2025 test accuracy rising from 70% to 87.5% [16]. - In benchmark tests, DeepSeek-R1-0528 achieved notable scores, such as 91.4% in AIME 2024 and 87.5% in AIME 2025 [17]. Reduction in Hallucination Rate - The hallucination rate of DeepSeek-R1-0528 has been reduced by 45%-50% compared to its predecessor, addressing previous concerns about high hallucination rates [20][24]. - This improvement allows the model to provide more accurate and reliable results in tasks such as summarization and reading comprehension [25][26]. Enhanced Functionality - DeepSeek-R1-0528 supports tool calls, enabling it to summarize articles by fetching content from links, achieving competitive scores in Tau-Bench [31]. - The model's front-end code generation capabilities have been enhanced, allowing for the rapid creation of applications with comprehensive features [33]. Distillation of Qwen3-8B - Alongside the R1 upgrade, DeepSeek has distilled the R1-0528 model's reasoning chain into a new version, DeepSeek-R1-0528-Qwen3-8B, which shows strong performance in mathematical tests, surpassing Qwen3-8B [6][37]. - The Qwen3-8B model, despite having significantly fewer parameters, demonstrates competitive performance, indicating the effectiveness of the distillation process [38]. Industry Positioning - Following the R1 upgrade, DeepSeek has been recognized as the second-largest AI laboratory globally, surpassing competitors like xAI, Meta, and Anthropic [44][46]. - The model's intelligence index score has increased from 60 to 68, reflecting a significant advancement comparable to OpenAI's improvements [46][47].
从打分器到思考者:RM-R1用推理重塑模型价值判断
机器之心· 2025-05-31 04:00
「知其然,亦知其所以然。」 文章验证了三个核心发现: 1. 规模带来增益:随着模型变大、计算力增强,RM-R1 的推理链训练方法效果越好,性能几乎线性提升; 这句儒家命题强调,真正的理解不仅在于结果,更在于推理过程。如今,在大型语言模型的后训练阶段,奖励模型承担着桥接模型行为与人类价值的重要职 责;但现有模型往往只给出一个分数,却难以解释其依据。缺乏推理的奖励,就如「知其然而不知其所以然」,既难以建立信任,也难以指导更优的学习。 伊利诺伊大学香槟分校的研究团队提出了 RM-R1 框架,将奖励建模重新定义为推理任务,提出了推理奖励模型(Reasoning Reward Models, ReasRMs)。RM-R1 关注于如何通过整合推理能力来增强奖励模型,使其能够更准确地对模型输出进行评估和打分,从而更好地与人类偏好对齐。RM- R1 通过生成结构化的评估标准和推理过程,提升了奖励模型的可解释性和性能。 2. 简单套用旧 RL 策略行不通:想让模型「会推理」,得精准划分问题类型、并对推理过程进行定向蒸馏训练,才能带来真正泛化的提升; 3. 推理比直接输出答案更通用:相比传统的直接监督,RM-R1 的推理能力更稳 ...
斯坦福意外用AI生成超强CUDA内核,性能比人类专家优化得还要好!翻倍碾压原生PyTorch,华人主创
量子位· 2025-05-31 03:34
Core Insights - AI unexpectedly generated kernels outperform those optimized by human experts, showcasing significant performance improvements in deep learning operations [1][2][4] Performance Metrics - AI-optimized kernels achieved up to 400% performance improvement over native PyTorch in common deep learning operations [2] - Specific performance metrics include: - Matrix multiplication (Matmul, FP32): 101.3% of PyTorch's torch.matmul - 2D convolution (Conv2D): 179.9% of torch.nn.Conv2D - Softmax: 111.8% of torch.softmax - Layer normalization (LayerNorm): 484.4% of torch.nn.LayerNorm - Conv2D + ReLU + MaxPool combination: 290.1% of PyTorch reference implementation and 189.0% of torch.compile() reference implementation [6] Research Methodology - The research team initially aimed to generate synthetic data for training kernel generation models but discovered that the synthetic data itself could produce high-performance kernels [3][40] - The optimization process involved a language reasoning step between iterations, encouraging diverse search processes [9][10] - The team employed a multi-branch exploration strategy, allowing multiple implementations to evolve from each idea, selecting the best-performing kernel for subsequent rounds [16][19] Implementation Details - Kernels were written in pure CUDA-C without relying on libraries like CUTLASS and Triton [13] - The optimization approach diverged from traditional sequential modifications, instead utilizing natural language to generate optimization ideas before translating them into code [14][15] - The research demonstrated that the generated kernels utilized advanced optimizations and hardware features previously considered difficult to implement [41] Future Prospects - The research team expressed optimism about future developments, noting that their initial goal of generating functional kernels has evolved into achieving significant performance improvements [47][48] - They highlighted ongoing optimization efforts, particularly in FP16 Matmul and FP16 Flash Attention, with current performance at 52% and 9% of torch.matmul and torch.nn.functional.scaled_dot_product_attention, respectively [46]
不用GPU,大模型每2秒吃透一道高数大题!这就是华为的实力
雷峰网· 2025-05-30 09:48
Core Viewpoint - Huawei defines the benchmark for domestic large model training through technological innovation, achieving breakthroughs in computing power utilization and post-training throughput [1][4]. Group 1: Technological Innovations - Huawei's "Ascend + Pangu Ultra MoE" combination has unlocked a fully controllable training loop for domestic computing power and models, achieving industry-leading performance in cluster training systems [4][5]. - The pre-training phase saw the Ascend Atlas 800T A2 cluster's model training utilization (MFU) increase to 41%, while the post-training phase achieved a throughput of 35K Tokens/s on a single CloudMatrix 384 super node [5][36]. - Huawei disclosed key technologies in its technical report, highlighting the efficient integration of sparse MoE reinforcement learning post-training frameworks [6][7]. Group 2: Challenges in Current Training Processes - Six main challenges were identified in the current MoE pre-training and reinforcement learning post-training processes, including difficulties in parallel strategy configuration, communication bottlenecks, uneven system load distribution, excessive operator scheduling overhead, complex training process management, and limitations in large-scale expansion [10][11]. Group 3: Solutions to Enhance Training Efficiency - Huawei proposed a complete end-to-end solution to address these challenges, focusing on enhancing training cluster utilization through intelligent parallel strategy selection, deep integration of computation and communication, and global dynamic load balancing [12][14]. - The first strategy involved optimizing parallel configurations, achieving a deployment that included 16 pipeline parallelism, 8 tensor parallelism, and 32 expert parallelism [15][16]. - The second strategy focused on releasing computing power at the single-node level, doubling the micro-batch size (MBS) and optimizing operator scheduling to fully utilize Ascend node capabilities [20][21]. Group 4: Reinforcement Learning Innovations - Huawei introduced the RL Fusion training and inference co-card technology, which supports flexible deployment modes and achieves a doubling of cluster utilization in post-training [28][29]. - The design of a semi-asynchronous mechanism, StaleSync, allows different tasks to execute in parallel while maintaining model accuracy, resulting in a 50% increase in overall training throughput [30]. Group 5: Performance Metrics and Future Prospects - The Pangu Ultra MoE model, with 718 billion parameters, demonstrated high performance during training, achieving a model utilization rate of 41% and a throughput of 35K Tokens/s in post-training [35][36]. - The system is designed to support ultra-large-scale clusters and models, with expectations for future iterations to achieve even higher utilization rates [35][36].
每2秒吃透一道高数大题!华为终于揭秘准万亿MoE昇腾训练系统全流程
华尔街见闻· 2025-05-30 09:38
Core Viewpoint - Huawei has achieved significant advancements in training large models through its "Ascend + Pangu Ultra MoE" system, demonstrating a fully domestic and GPU-free training process that enhances computational efficiency and model performance [3][4][38]. Group 1: Technical Innovations - Huawei's training system has achieved a model training efficiency with a utilization rate (MFU) of 41% during the pre-training phase using the Ascend Atlas 800T A2 cluster [4][38]. - The Pangu Ultra MoE model consists of 718 billion parameters, featuring a unique architecture with 61 layers, including 58 MoE layers, and is designed for high performance and scalability [38][39]. - The system supports a high throughput of 35K Tokens/s during the reinforcement learning (RL) post-training phase, showcasing its capability to process complex tasks rapidly [39]. Group 2: Challenges Addressed - The report identifies six key challenges in the current MoE pre-training and RL post-training processes, including difficulties in parallel strategy configuration, communication bottlenecks, and uneven system load distribution [7][10][12][13]. - Huawei has developed a comprehensive end-to-end solution to address these challenges, focusing on optimizing training cluster utilization and enhancing communication efficiency [14][16][25]. Group 3: Specific Solutions - The first strategy involves improving training cluster utilization through intelligent parallel strategy selection and global dynamic load balancing, significantly enhancing overall training efficiency [16][23]. - The second strategy focuses on releasing computational power at the single-node level by optimizing training operators and enhancing memory management, achieving a twofold increase in micro-batch size [26][30]. - The third strategy introduces high-performance scalable RL post-training technologies, allowing for flexible deployment modes and doubling the utilization rate of RL post-training clusters [33][34].
机器狗能当羽毛球搭子了!仅靠强化学习从0自学,还涌现出类人回位行为 | Science子刊
量子位· 2025-05-30 07:10
衡宇 发自 凹非寺 量子位 | 公众号 QbitAI 来和机器狗一起运动不?你的羽毛球搭子来了! 无需人工协助,仅靠强化学习 ,机器狗子就学会了羽毛球哐哐对打,就像这样—— 在室外: 在室内: 都不在话下。 基于强化学习,研究人员开发了机器狗的全身视觉运动控制策略,同步控制腿部 (18个自由度) 移动,和手臂挥拍动作。 最终呈现出来的表现不赖,狗子最高挥拍速度达到12米/秒。 在与人类选手的协作比赛中, 某一回合连续击球10次 ,甚至涌现出如击球后回位中心的类人行为。 该研究在各种环境中进行了大量实验,验证了四足机器人预测羽毛球轨迹、有效导航服务区域,以及对人类球员进行最精准打击的能力。 证明了足式移动机器人在复杂和动态的体育场景中应用的可行性 。 研究背后团队来自 苏黎世联邦理工学院 。 相关论文刚刚发表在Science旗下子刊Science Robotics上。 然后生成关键指令,来控制四足底座。 羽毛球"大战"中涌现出类人行为 学会打羽毛球的机器狗是什么配置? 公开数据如下: 主体由 一个四足ANYmal-D底座 和 一个动态手臂DynaArm 组成。 它 配备了一个带有全局快门的ZED X立体相机用于 ...
成本暴降88%!通义实验室、北大发布ZeroSearch,无需搜索即可激活LLM检索能力
机器之心· 2025-05-29 04:53
Core Insights - The article introduces the ZeroSearch framework, which enables large language models (LLMs) to activate their search capabilities without relying on real search engines, significantly reducing training costs by 88% while outperforming methods that depend on actual search engines [1][21]. Methodology - ZeroSearch employs a reinforcement learning (RL) framework that utilizes a simulation LLM as a search engine, eliminating the need for real-time API interactions, thus lowering training costs [4][6]. - The framework incorporates a structured training template that guides the model through each interaction, enhancing the clarity and interpretability of the reasoning process [8]. - A loss masking technique is applied to prevent the strategy model from memorizing documents generated by the simulation LLM, ensuring that only tokens generated by the strategy model are considered for loss calculation [4][8]. Training Strategy - The training process begins with a gradual increase in difficulty, allowing the model to learn basic output formats and task logic before rapidly escalating the challenge to enhance reasoning capabilities [22][36]. - A curriculum learning strategy is implemented, progressively lowering the quality of generated documents to stimulate the model's reasoning ability effectively [13][36]. Experimental Results - ZeroSearch demonstrates superior performance across various datasets, achieving an average score of 40.93 in multi-hop question answering tasks, surpassing all baseline methods [20][21]. - The framework shows robust generalization capabilities, with performance improving as model parameters increase, indicating strong scalability [23][27]. - In comparison to real search engines, ZeroSearch exhibits a significant potential to replace them in large-scale RL applications, showcasing its effectiveness in enhancing search capabilities [21][24]. Conclusion - The ZeroSearch framework effectively activates the search capabilities of LLMs without the need for real search engines, demonstrating strong adaptability and scalability across different RL algorithms [36].
Claude 4 核心成员访谈:提升 Agent 独立工作能力,强化模型长程任务能力是关键
Founder Park· 2025-05-28 13:13
Core Insights - The main change expected in 2025 is the effective application of reinforcement learning (RL) in language models, particularly through verifiable rewards, leading to expert-level performance in competitive programming and mathematics [4][6][7]. Group 1: Reinforcement Learning and Model Development - Reinforcement learning has activated existing knowledge in models, allowing them to organize solutions rather than learning from scratch [4][11]. - The introduction of Opus 4 has significantly improved context management for multi-step actions and long-term tasks, enabling models to perform meaningful reasoning and execution over extended periods without frequent user intervention [4][32]. - The current industry trend prioritizes computational power over data and human feedback, which may evolve as models become more capable of learning in real-world environments [4][21]. Group 2: Future of AI Agents - The potential for AI agents to automate intellectual tasks could lead to significant changes in the global economy and labor market, with predictions of "plug-and-play" white-collar AI employees emerging within the next two years [7][9]. - The interaction frequency between users and models is expected to shift from seconds and minutes to hours, allowing users to manage multiple models simultaneously, akin to a "fleet management" approach [34][36]. - The development of AI agents capable of completing tasks independently is anticipated to accelerate, with models expected to handle several hours of work autonomously by the end of the year [36][37]. Group 3: Model Capabilities and Limitations - Current models still lack self-awareness in the philosophical sense, although they exhibit a form of meta-cognition by expressing uncertainty about their answers [39][40]. - The models can simulate self-awareness but do not possess a continuous identity or memory unless explicitly designed with external memory systems [41][42]. - The understanding of model behavior and decision-making processes is still evolving, with ongoing research into mechanisms of interpretability and the identification of features that drive model outputs [46][48]. Group 4: Future Developments and Expectations - The frequency of model releases is expected to increase significantly, with advancements in reinforcement learning leading to rapid improvements in model capabilities [36][38]. - The exploration of long-term learning mechanisms and the ability for models to evolve through practical experience is a key area of focus for future research [30][29]. - The ultimate goal of model interpretability is to establish a clear understanding of how models make decisions, which is crucial for ensuring their reliability and safety in various applications [46][47].
三位顶流AI技术人罕见同台,谈了谈AI行业最大的「罗生门」
3 6 Ke· 2025-05-28 11:59
Core Insights - The AI industry is currently experiencing a significant debate over the effectiveness of pre-training models versus first principles, with notable figures like Ilya from OpenAI suggesting that pre-training has reached its limits [1][2] - The shift from a consensus-driven approach to exploring non-consensus methods is evident, as companies and researchers seek innovative solutions in AI [6][7] Group 1: Industry Trends - The AI landscape is witnessing a transition from a focus on pre-training to exploring alternative methodologies, with companies like Sand.AI and NLP LAB leading the charge in applying multi-modal architectures to language and video models [3][4] - The emergence of new models, such as Dream 7B, demonstrates the potential of applying diffusion models to language tasks, outperforming larger models like DeepSeek V3 [3][4] - The consensus around pre-training is being challenged, with some experts arguing that it is not yet over, as there remains untapped data that could enhance model performance [38][39] Group 2: Company Perspectives - Ant Group's Qwen team, led by Lin Junyang, has faced criticism for being conservative, yet they emphasize that their extensive experimentation has led to valuable insights, ultimately reaffirming the effectiveness of the Transformer architecture [5][15] - The exploration of Mixture of Experts (MoE) models is ongoing, with the team recognizing the potential for scalability while also addressing the challenges of training stability [16][20] - The industry is increasingly focused on optimizing model efficiency and effectiveness, with a particular interest in achieving a balance between model size and performance [19][22] Group 3: Technical Innovations - The integration of different model architectures, such as using diffusion models for language generation, reflects a broader trend of innovation in AI [3][4] - The challenges of training models with long sequences and the need for effective optimization strategies are critical areas of focus for researchers [21][22] - The potential for future breakthroughs lies in leveraging increased computational power to revisit previously unviable techniques, suggesting a cycle of innovation driven by advancements in hardware [40][41]