Workflow
强化学习
icon
Search documents
强化学习远不是最优,CMU刚刚提出最大似然强化学习
机器之心· 2026-02-05 07:52
机器之心编辑部 在大模型时代,从代码生成到数学推理,再到自主规划的 Agent 系统,强化学习几乎成了「最后一公里」的标准配置。 直觉上,开发者真正想要的其实很简单: 让模型更有可能生成「正确轨迹」 。从概率角度看,这等价于最大化正确输出的概率,也就是经典的最大似然 (Maximum Likelihood)目标。 然而,一项来自 CMU、清华大学、浙江大学等研究机构的最新工作指出了一个颇具颠覆性的事实: 现实中广泛使用的强化学习,并没有真正在做最大似然优化。严格的理论分析显示, 强化学习只是在优化最大似然目标的一阶近似 —— 距离我们以为的最 优训练目标,其实还差得很远。 正是基于这一观察,研究团队对强化学习的目标函数进行了重新审视,提出了最大似然强化学习(Maximum Likelihood Reinforcement Learning):将 基于正确性的强化学习重新刻画为一个潜变量生成的最大似然问题,进一步引入一族 以计算量为索引的目标函数,使训练目标能够逐步逼近真正的最大似然 优化。 论文标题: Maximum Likelihood Reinforcement Learning 论文链接: https: ...
中金:2026年大模型将取得更多突破 向实现AGI长期目标更进一步
Zhi Tong Cai Jing· 2026-02-05 01:39
智通财经APP获悉,中金发布研报称,2025年全球大模型技术能力向前演进,逐步攻克生产力场景,在 推理、编程、Agentic以及多模态等能力方向取得明显进步,但模型通用能力在稳定性、幻觉率等方面 仍存在短板。展望2026年,该行认为大模型在强化学习、模型记忆、上下文工程等方面将取得更多突 破,从短context生成到长思维链任务,从文本交互到原生多模态,并向实现AGI长期目标更进一步。 中金主要观点如下: 强化学习重要性提升,成为解锁模型高级能力的关键 强化学习的引入提高了模型的智能上限,让模型可以更有逻辑、更符合人类偏好进行思考和推理,其本 质是"自我生成数据+多轮迭代",强化学习的关键在于大规模算力+高质量数据。海外OpenAI、Gemini 等模型厂商对于强化学习非常重视,国内DeepSeek、阿里千问等也在跟进,该行预计2026年海内外模 型厂商强化学习占比将进一步提升。 持续学习、模型记忆、世界模型等新路线将迎来核心突破 持续学习和模型记忆本质上是解决大模型"灾难性遗忘"问题,让模型具备选择性记忆机制。Google提出 的Titans、MIRAS、Nested Learning等算法和架构核心是让模 ...
中金 | AI十年展望(二十六):2026关键趋势之模型技术篇
中金点睛· 2026-02-04 23:52
Core Insights - The article discusses the advancements in large model technology, highlighting improvements in reasoning, programming, agentic capabilities, and multimodal abilities, while also noting existing shortcomings in general reliability and memory capabilities [1][4]. Model Architecture and Optimization - The Transformer architecture continues to dominate, with a consensus on the efficiency of the Mixture of Experts (MoE) model, which activates only a subset of parameters, significantly reducing computational costs [17][18]. - The industry is exploring various attention mechanisms to balance precision and efficiency, including Full-Attention, Linear-Attention, and Hybrid-Attention [20]. Model Capabilities - Significant progress has been made in reasoning, programming, agentic tasks, and multimodal applications, with models achieving real productivity levels in various domains [3][4]. - The introduction of reinforcement learning is crucial for unlocking advanced model capabilities, allowing for more logical reasoning aligned with human preferences [2][23]. Competitive Landscape - Major players like OpenAI, Gemini, and Anthropic are intensifying their competition, with OpenAI focusing on enhancing reasoning and multimodal integration, while Gemini has made significant strides in model capabilities and is leveraging high-quality data for improvements [11][42][43]. - Domestic models are catching up, maintaining a static gap of about six months behind their international counterparts, with companies like Alibaba and ByteDance producing competitive models [12][14]. Future Directions - The focus for 2026 includes further advancements in reinforcement learning, continuous learning, and world models, with expectations for models to tackle more complex tasks and achieve long-term goals like AGI [27][40]. - Continuous learning and model memory are seen as essential for achieving lifelong learning capabilities, with new algorithms like MIRAS and HOPE being pivotal in this evolution [28][32].
致敬Kimi K2:基于slime的全流程INT4量化感知RL训练
机器之心· 2026-02-03 10:35
Core Insights - The SGLang RL team has successfully implemented the INT4 Quantization-Aware Training (QAT) process inspired by the Kimi K2 team, achieving stability and consistency comparable to BF16 full precision training while enabling extreme compression of large models [2][3][4]. Technical Overview - The project is a collaboration among multiple teams, including SGLang RL, InfiXAI, Ant Group, and others, with functionalities shared in the slime and Miles communities [4]. - A complete QAT INT4 closed-loop solution has been established, enhancing training stability and efficiency in reinforcement learning (RL) scenarios [6]. - The rollout efficiency has significantly improved by eliminating cross-machine communication bottlenecks, allowing 1TB models to fit within a single H200 (141G) GPU memory [6][10]. Training Process - The training phase utilizes Fake Quantization to simulate quantization noise while maintaining high precision BF16 weights, ensuring the model adapts to low precision representations [8][9]. - The Straight-Through Estimator (STE) technique allows gradients to bypass the non-differentiable quantization operations, maintaining the training continuity [9][11]. - The transition from BF16 weights to INT4 format is executed during the weight conversion phase, facilitating efficient inference [10][25]. Performance Evaluation - Experiments demonstrate that the QAT INT4 training approach maintains robust performance, with the rollout configuration showing consistent growth in raw rewards compared to BF16 and FP8 configurations [41][46]. - The INT4 QAT strategy effectively mitigates discrepancies between training and inference outputs, achieving a high degree of consistency [51][56]. Future Directions - The project aims to explore further optimizations to enhance training efficiency and investigate the application of FP4 precision in RL training and inference as NVIDIA's Blackwell architecture becomes more prevalent [58][62].
雷军官宣小米多篇最新研究成果成功入选ICLR 2026国际顶级会议
Sou Hu Cai Jing· 2026-02-03 03:13
Core Insights - Xiaomi's founder and CEO Lei Jun announced that multiple research achievements from the Xiaomi team have been selected for ICLR 2026, covering areas such as multimodal reasoning, reinforcement learning, GUI agents, end-to-end autonomous driving, and audio generation [1][3]. Group 1: Research Achievements - The research paper titled "Shuffle-R1: Efficient RL framework for Multimodal Large Language Models via Data-centric Dynamic Shuffle" addresses inefficiencies in existing reinforcement learning training processes, particularly issues like Advantage Collapsing and Rollout Silencing, which hinder long-term optimization capabilities [4]. - Shuffle-R1 proposes a streamlined reinforcement learning framework that significantly enhances training efficiency through two core designs: Pairwise Trajectory Sampling and Advantage-based Batch Shuffle, leading to improved gradient signal quality and increased exposure of valuable trajectories [4]. - Experimental results indicate that Shuffle-R1 consistently outperforms various reinforcement learning baselines with minimal computational overhead [4]. Group 2: Mobile Agents and GUI - The paper "MobileIPL: Enhancing Mobile Agents Thinking Process via Iterative Preference Learning" introduces a framework to improve the reasoning and planning capabilities of Mobile GUI Agents, addressing challenges such as the scarcity of high-quality CoaT trajectories and the limitations of existing self-training methods [7][8]. - MobileIPL employs Thinking-level DPO and Instruction Evolution to enhance process supervision and expand task distribution, resulting in state-of-the-art performance on mainstream GUI-Agent benchmarks [8][10]. Group 3: Language Models - "FutureMind: Equipping Small Language Models with Strategic Thinking-Pattern Priors via Adaptive Knowledge Distillation" presents a modular reasoning framework for small language models (SLMs) that enhances their performance in complex tasks without additional training or parameter increments [12][13]. - FutureMind extracts advanced cognitive abilities from large language models (LLMs) through adaptive knowledge distillation, creating a dynamic reasoning pipeline that significantly improves reasoning efficiency and retrieval accuracy [12][13]. Group 4: Multimodal Reasoning - The paper "ThinkOmni: Lifting Textual Reasoning to Omni-modal Scenarios via Guidance Decoding" proposes a framework that transfers mature textual reasoning capabilities to multimodal scenarios without the need for costly model fine-tuning [16][17]. - ThinkOmni includes components like LRM-as-a-Guide and Stepwise Contrastive Scaling, which balance perception and reasoning signals, demonstrating consistent performance improvements across multiple multimodal reasoning benchmarks [17]. Group 5: Audio Generation - "Flow2GAN: Hybrid Flow Matching and GAN with Multi-Resolution Network for Few-step High-Fidelity Audio Generation" introduces a two-stage audio generation framework that combines Flow Matching pre-training with lightweight GAN fine-tuning for efficient audio generation [23][24]. - The framework enhances audio modeling capabilities by addressing the unique properties of audio signals and demonstrates superior performance in generating high-fidelity audio with improved computational efficiency compared to existing methods [24].
DeepMind强化学习掌门人David Silver离职创业,Alpha系列AI缔造者,哈萨比斯左膀右臂
3 6 Ke· 2026-02-02 08:21
Core Insights - David Silver, a prominent researcher in reinforcement learning, has left DeepMind after 15 years to establish his own AI company, Ineffable Intelligence [1][5]. Company Formation - Ineffable Intelligence was quietly founded in November 2025, with Silver officially appointed as a director on January 16, 2026 [2]. - The company is headquartered in London and is actively recruiting AI research talent while seeking venture capital [3]. Contributions at DeepMind - Silver was a key figure in the development of DeepMind's "Alpha series," leading or significantly contributing to major projects such as AlphaGo, AlphaZero, MuZero, and AlphaStar [7][9]. - His work on AlphaGo, which defeated world champion Lee Sedol in 2016, marked a significant milestone in AI history [9]. - Silver has received multiple accolades, including the ACM Prize in Computing in 2019 and the Royal Academy of Engineering Silver Medal in 2017 [10]. Academic and Research Impact - Silver is one of the most published authors among DeepMind employees, with over 280,000 citations and an h-index of 104 according to Google Scholar [11]. - His research has focused on advancing AI capabilities beyond human knowledge, advocating for a new "Age of Experience" where AI learns from its own experiences [17][19]. Vision for AI - Silver aims to tackle the challenge of creating superintelligent AI that can learn independently from first principles, moving away from reliance on human knowledge [17][19].
CPU迎来AIAgent时代新机遇
Orient Securities· 2026-01-31 07:15
Investment Rating - The report maintains a "Positive" investment rating for the computer industry, indicating an expectation of returns exceeding the market benchmark by more than 5% [3][9]. Core Insights - The server CPU supply from Intel and AMD is constrained, leading to a projected price increase of 10%-15% due to surging demand from customers like CSPs. The production capacity for server CPUs is essentially sold out for the year 2026 [4]. - The price increase is driven by limited advanced process capacity and unexpectedly high downstream demand, particularly as the general server market enters a significant upgrade cycle and AI demand continues to exceed expectations [4]. - The report suggests that the current price increase for server CPUs reflects a structural shift in demand rather than a short-term fluctuation, with expectations for continued growth in both quantity and performance requirements for CPUs [4]. - Domestic CPU manufacturers are expected to benefit from this supply-demand imbalance, with companies like Haiguang Information and Loongson expected to fill the demand gap as domestic cloud service providers accelerate evaluations of domestic alternatives [4]. Summary by Sections Investment Recommendations and Targets - Recommended stocks include Haiguang Information (688041, Buy), Zhongke Shuguang (603019, Buy), and others, as they are positioned to benefit from the supply constraints faced by Intel and AMD [2]. Industry Dynamics - The report highlights a significant shift in the AI landscape, where the demand for high single-core performance and memory bandwidth is becoming critical due to the rise of AI agents and reinforcement learning applications [4]. - The infrastructure focus is expected to shift from "GPU compute power" to "CPU scheduling," indicating a long-term trend in the industry [4].
AlphaGo之父David Silver离职创业,目标超级智能
机器之心· 2026-01-31 02:34
Core Viewpoint - David Silver, a prominent AI researcher from Google DeepMind, has left the company to establish a new startup named Ineffable Intelligence, focusing on solving complex AI challenges and pursuing superintelligence [1][3][4]. Group 1: Company Formation and Background - Ineffable Intelligence is being founded in London, with active recruitment for AI researchers and seeking venture capital [3]. - Silver was a key figure at Google DeepMind, contributing to significant achievements such as AlphaGo, AlphaStar, and AlphaZero, which demonstrated the capabilities of AI in complex games [9][12][14]. - The company was officially registered in November 2025, with Silver appointed as a director in January 2026 [4]. Group 2: Silver's Contributions and Vision - Silver's work includes developing AI systems that surpassed human capabilities in games, showcasing the potential of AI to learn and adapt [12][14]. - He emphasizes the need for AI to explore and discover knowledge independently, moving beyond human limitations and biases [18][23]. - The vision for Ineffable Intelligence is to create a self-learning superintelligence that can autonomously uncover foundational knowledge [23]. Group 3: Industry Context and Trends - Silver's departure follows a trend where notable AI researchers are leaving established labs to pursue startups focused on superintelligence, with significant funding being raised in the sector [15]. - Other notable figures, such as Ilya Sutskever and Yann LeCun, are also venturing into similar domains, indicating a growing interest in the pursuit of advanced AI capabilities [15][16].
DeepMind强化学习掌门人David Silver离职创业!Alpha系列AI缔造者,哈萨比斯左膀右臂
量子位· 2026-01-31 01:34
Core Viewpoint - David Silver, a prominent figure in reinforcement learning and a key researcher at DeepMind for 15 years, has left the company to establish his own AI startup, Ineffable Intelligence, aiming to tackle the challenges of achieving superintelligence in AI [2][21]. Group 1: Departure from DeepMind - David Silver has officially left DeepMind and has been appointed as the director of his new company, Ineffable Intelligence, which was quietly established in November 2025 [3][2]. - Silver had been on leave for several months prior to his departure from DeepMind [4]. - Google DeepMind confirmed Silver's departure and expressed gratitude for his contributions during his tenure [9]. Group 2: Achievements at DeepMind - Silver was instrumental in the development of several landmark AI projects at DeepMind, including AlphaGo, which defeated world champion Lee Sedol in 2016, marking a significant milestone in AI history [14]. - He also led the development of AlphaZero, which achieved superhuman performance in Go, chess, and shogi without relying on human game data [14]. - Silver contributed to the creation of MuZero, which learns to play games without being informed of the rules, and AlphaStar, which defeated top players in StarCraft II [15][16]. - He has received numerous accolades, including the ACM Prize in Computing in 2019 and the Royal Academy of Engineering Silver Medal in 2017 [18]. Group 3: Vision for the Future - Silver's motivation for founding Ineffable Intelligence is to return to the awe and wonder of solving the most challenging problems in AI, with a focus on creating a superintelligent AI that can learn endlessly [21]. - He advocates for a new "Age of Experience" in AI, where systems learn from experiences through reinforcement learning, moving beyond reliance on human knowledge [24]. - Silver believes that achieving true superintelligence requires AI to learn from first principles, independent of human intuition and knowledge [25].
又一清华强将加盟腾讯混元,即将入职多模态模型团队负责强化学习前沿算法探索
Feng Huang Wang· 2026-01-30 05:35
Core Insights - The article discusses the recent hiring of Dr. Tangyu Pang, a prominent scholar in machine learning, by Tencent as the Principal Scientist for the Hunyuan large model team, focusing on multimodal reinforcement learning and generative models [1][2]. Group 1: Talent Acquisition - Dr. Pang will officially join Tencent on February 4, with an emphasis on generative models in the initial phase of his work [1]. - His previous experience includes being a senior research scientist at Sea AI Lab in Singapore, and he has a strong academic background with multiple publications in top machine learning conferences [2]. - The hiring of Dr. Pang follows the recent recruitment of another young scientist, Yao Shunyu, indicating Tencent's intensified efforts to attract top AI talent [2]. Group 2: Organizational Changes - Tencent's Hunyuan large model team has undergone significant restructuring, as noted by CEO Ma Huateng, to enhance talent acquisition and improve the research and development team [2]. - The establishment of new departments such as AI Infra and AI Data, along with the appointment of Yao Shunyu as Chief AI Scientist, signals a strategic acceleration in Tencent's AI initiatives [3]. - The Hunyuan team has also made advancements in user experience with the AI assistant "Yuanbao," which has rapidly grown to become one of the top AI applications in China [3]. Group 3: Product Development - Tencent's Hunyuan team announced the open-sourcing of Hunyuan Image 3.0, which has achieved a top-tier position in the global LMArena image editing rankings, marking it as one of the strongest open-source image generation models [3].