强化学习
Search documents
「Next-Token」范式改变!刚刚,强化学习预训练来了
机器之心· 2025-06-11 03:54
Core Viewpoint - The article discusses the emerging importance of Reinforcement Learning (RL) in enhancing AI model capabilities, particularly through a new paradigm called Reinforcement Pre-Training (RPT) which redefines next-token prediction as a reasoning task [3][10][24]. Summary by Sections Introduction - Yann LeCun previously viewed reinforcement learning as a minor component in AI, but its significance is growing in model enhancement [3]. RPT Overview - RPT transforms the next-token prediction task into a reasoning process, allowing models to receive verifiable rewards for correct predictions [6][25]. - This method leverages vast amounts of unannotated text data for general reinforcement learning without requiring domain-specific labeled answers [9][26]. Advantages of RPT - RPT offers inherent scalability and generality by utilizing large unannotated datasets for training [28]. - It minimizes the risk of reward hacking by using direct, rule-based reward signals [29]. - The internal reasoning process during pre-training allows for deeper understanding and generalization beyond mere token memorization [30]. - RPT enhances prediction accuracy by allocating more computational resources to each prediction step [31]. Experimental Results - RPT outperforms baseline methods in next-token prediction accuracy across various difficulty levels [40][41]. - The performance of RPT-14B is comparable to that of larger models, indicating its effectiveness in capturing complex reasoning signals [43]. - RPT's accuracy improves reliably with increased training computation, demonstrating its scaling characteristics [45]. - Models pre-trained with RPT achieve higher performance ceilings when further trained with RLVR, showcasing its ability to transfer learned reasoning patterns to downstream tasks [47]. Zero-Shot Performance - RPT-14B consistently surpasses R1-Distill-Qwen-14B across all benchmark tests, even outperforming larger models in next-token prediction [49]. Reasoning Mode Analysis - The reasoning process of RPT-14B differs qualitatively from that of R1-Distill-Qwen-14B, indicating a more thoughtful approach rather than simple pattern matching [51].
Mistral的首个强推理模型:拥抱开源,推理速度快10倍
机器之心· 2025-06-11 03:54
Core Viewpoint - Mistral AI has launched a new series of large language models (LLMs) named Magistral, showcasing strong reasoning capabilities and the ability to tackle complex tasks [4]. Group 1: Model Overview - The launch includes two versions: a proprietary model for enterprise clients called Magistral Medium and an open-source version with 24 billion parameters named Magistral Small [5]. - The open-source version is available under the Apache 2.0 license, allowing for free use and commercialization [5]. Group 2: Performance Metrics - In benchmark tests, Magistral Medium scored 73.6% on AIME2024, with a majority vote score of 64% and a score of 90% [6]. - Magistral Small achieved scores of 70.7% and 83.3% in the same tests [6]. - The model also excelled in high-demand tests such as GPQA Diamond and LiveCodeBench [7]. Group 3: Technical Features - Magistral Medium demonstrates programming capabilities, generating code to simulate gravity and friction [10]. - The model maintains high-fidelity reasoning across multiple languages, including English, French, Spanish, German, Italian, Arabic, Russian, and Chinese [11]. - With Flash Answers in Le Chat, Magistral Medium can achieve up to 10 times the token throughput compared to most competitors, enabling large-scale real-time reasoning and user feedback [14]. Group 4: Learning Methodology - Mistral employs a proprietary scalable reinforcement learning pipeline, relying on its own models and infrastructure rather than existing implementations [15]. - The model's design principle focuses on reasoning in the same language as the user, minimizing code-switching and enhancing performance in reasoning tasks [16][17]. Group 5: Market Positioning - Magistral Medium is being integrated into major cloud platforms, including Amazon SageMaker, with plans for Azure AI, IBM WatsonX, and Google Cloud Marketplace [20]. - The pricing for input tokens is set at $2 per million and $5 per million for output tokens, significantly higher than the previous Mistral Medium 3 model, which was $0.4 and $2 respectively [21]. - Despite the price increase, Magistral Medium's pricing strategy remains competitive compared to external competitors, being cheaper than OpenAI's latest models and on par with Gemini 2.5 Pro [22].
腾讯研究院AI速递 20250611
腾讯研究院· 2025-06-10 14:58
Group 1: Apple Developments - Apple has unified the design of six major operating systems, introducing a new "Liquid Glass" element that significantly enhances visual effects [1] - The company has opened access to on-device large language models for all apps, integrating AI functionalities such as visual search and real-time translation [1] - Major updates to iPadOS and enhanced macOS-iPhone integration were announced, but the release of the new Siri has been delayed again [1] Group 2: Developer Tools - Apple announced Xcode 26, which integrates ChatGPT to assist developers in code writing, documentation generation, and error fixing [2] - Developers can introduce AI models from other vendors into Xcode via API keys, fostering a diverse intelligent programming ecosystem [2] - The Foundation Models framework allows developers to call local AI models with just three lines of code [2] Group 3: NoCode Tool by Meituan - Meituan launched the NoCode AI Coding Agent tool, enabling users to create websites and applications without programming [3] - NoCode combines product, design, and engineering functionalities, supporting various application scenarios such as website design and game development [3] - The tool features the ability to understand implicit needs and supports collaborative work, now fully launched and available for free [3] Group 4: Tencent's Yuanbao Upgrade - Tencent's Yuanbao desktop version has upgraded its text selection feature, adding continuous selection for automatic translation [4] - A new window pinning feature allows the translation results window to remain fixed, enhancing reading efficiency [4] - The upgraded functionality is particularly useful for browsing foreign websites and reading English documents [4] Group 5: Meta's Nuclear Power Agreement - Meta signed a 20-year nuclear power purchase agreement with Constellation Energy, with a capacity of 1,121 megawatts from the Clinton Clean Energy Center in Illinois [5] - This agreement surpasses Microsoft's previous collaboration of 835 megawatts, aimed at supporting Meta's growing energy needs for data centers and AI development [5] - The partnership will retain over 1,100 jobs and increase power generation by 30 megawatts, with supply expected to start in 2027 to support Meta's planned 1.3 million GPU scale [5] Group 6: AI Chip Design by Chinese Academy of Sciences - The Chinese Academy of Sciences launched the "Enlightenment" system, achieving fully automated design of processor chips, with performance meeting or exceeding human expert levels [6] - The system has successfully designed the RISC-V CPU "Enlightenment 2," matching the performance of ARM Cortex A53, and can automatically configure operating systems and high-performance libraries [6] - The "Enlightenment" system employs a three-layer architecture and a "three-step" technical route, potentially transforming chip design paradigms and significantly enhancing design efficiency [6] Group 7: AI Voice Interaction Insights - The founder of ElevenLabs suggests that incorporating "imperfections" in AI voice can enhance user interaction, as overly perfect voices may reduce engagement [8] - Future voice agents are expected to possess contextual awareness, transitioning from passive customer service to proactive user experience guidance [8] - As AI voice technology evolves, a new trust mechanism will emerge, focusing on verifying whether content is human-voiced rather than AI-generated [8] Group 8: Richard Sutton's Vision on AI - Richard Sutton, the father of reinforcement learning, believes AI is transitioning from the "human data era" to the "experience era," learning from real-time interactions with the environment [9] - He advocates for a decentralized cooperative model for AI development, opposing centralized control based on fear [9] - Sutton categorizes the evolution of the universe into four eras, asserting that humanity is transitioning from the third to the fourth era, with the mission to design systems capable of design [9] Group 9: Sergey Levine's Perspective on AI Learning - Professor Sergey Levine from UC Berkeley posits that large language models may merely be observers in a "Plato's cave," learning indirectly from human thought through internet text [10] - He questions why language models can learn rich knowledge from predicting the next token, while video models learn less despite containing more physical world information [10] - This perspective suggests that current AI systems may only mimic human thought rather than truly understanding the world, indicating a need for AI to learn from physical experiences [10]
强化学习之父:LLM主导只是暂时,扩展计算才是正解
量子位· 2025-06-10 02:23
Core Viewpoint - The dominance of large language models (LLMs) is temporary, and they will not remain at the forefront of technology in the next five to ten years [1][2]. Group 1: Current State of AI - Richard Sutton, a Turing Award winner and father of reinforcement learning, emphasizes that current AI models like ChatGPT rely on analyzing vast amounts of human-generated data [9]. - He argues that pursuing human-like thinking will only achieve "human-level" performance, and in fields like mathematics and science, the knowledge within human data is nearing its limits, making further innovation through mere imitation difficult [10][11]. Group 2: Future of AI Learning - Sutton believes AI must transition from relying on human data to acquiring "experience data" through first-person interactions with the world [13][14]. - He illustrates this with the example of AlphaGo's unconventional move against Lee Sedol, showcasing AI's potential for innovative thinking through experiential learning [14]. - The future of AI will belong to an "experience era," where agents learn from interactions, which exceeds the capabilities of current LLMs [18]. Group 3: Reinforcement Learning and Computational Power - Sutton states that the core path to the future of AI lies in reinforcement learning, which is centered around experiential learning [19]. - To fully leverage reinforcement learning, deep learning algorithms with continuous learning capabilities are essential [20]. - The support of large-scale computational power is crucial for expanding AI capabilities and meeting increasing performance demands [22][23]. Group 4: Decentralized Cooperation Among Agents - Sutton discusses the potential for decentralized cooperation among agents with different goals, suggesting that they can achieve mutual benefits through interaction [24]. - He critiques the calls for centralized control of AI, attributing such views to fear of the unknown, and advocates for embracing the diversity of individual goals to establish a cooperative order [26]. Group 5: The Design Era - Sutton introduces the concept of a "design era," where machines become increasingly life-like, yet emphasizes the fundamental differences between life and technology [29]. - He posits that the goal of developing AI is to achieve the ultimate design—creating agents capable of self-design, with humans acting as catalysts and creators in this process [29]. Group 6: Community Reactions - Sutton's statements have sparked intense discussions within the community, with supporters arguing that breakthroughs often arise from the unknown and that LLMs may be approaching their limits [30][31].
全景解读强化学习如何重塑 2025-AI | Jinqiu Select
锦秋集· 2025-06-09 15:22
Core Insights - The article discusses the transformative impact of reinforcement learning (RL) on the AI industry, highlighting its role in advancing AI capabilities towards artificial general intelligence (AGI) [3][4][9]. Group 1: Reinforcement Learning Advancements - Reinforcement learning is reshaping the AI landscape by shifting hardware demands from centralized pre-training architectures to distributed inference-intensive architectures [3]. - The emergence of recursive self-improvement allows models to participate in training the next generation of models, optimizing compilers, improving kernel engineering, and adjusting hyperparameters [2][4]. - The performance metrics of models, such as those measured by SWE-Bench, indicate that models are becoming more efficient and cost-effective while improving performance [5][6]. Group 2: Model Development and Future Directions - OpenAI's upcoming o4 model will be built on the more efficient GPT-4.1, marking a strategic shift towards optimizing reasoning efficiency rather than merely pursuing raw intelligence [4][108]. - The o5 and future plans aim to leverage sparse expert mixture architectures and continuous algorithm breakthroughs to advance model capabilities intelligently [4]. - The article emphasizes the importance of high-quality data as a new competitive advantage in the scaling of RL, enabling companies to build unique advantages without massive budgets for synthetic data [54][55]. Group 3: Challenges and Opportunities in RL - Despite strong progress, scaling RL computation faces new bottlenecks and challenges across the infrastructure stack, necessitating significant investment [9][10]. - The complexity of defining reward functions in non-verifiable domains poses challenges, but successful applications have been demonstrated, particularly in areas like writing and strategy formulation [24][28]. - The introduction of evaluation standards and the use of LLMs as evaluators can enhance the effectiveness of RL in non-verifiable tasks [29][32]. Group 4: Infrastructure and Environment Design - The design of robust environments for RL is critical, as misconfigured environments can lead to misunderstandings of tasks and unintended behaviors [36][38]. - The need for environments that can provide rapid feedback and accurately simulate real-world scenarios is emphasized, as these factors are crucial for effective RL training [39][62]. - Investment in environment computing is seen as a new frontier, with potential for creating highly realistic environments that can significantly enhance RL performance [62][64]. Group 5: The Future of AI Models - The article predicts that the integration of RL will lead to a new model iteration update paradigm, allowing for continuous improvement post-release [81][82]. - Recursive self-improvement is becoming a reality, with models participating in the training and coding of subsequent generations, enhancing overall efficiency [84][88]. - The article concludes with a focus on OpenAI's future strategies, including the development of models that balance strong foundational capabilities with practical RL applications [107][108].
AGI最后拼图,一文看懂什么是强化学习?其护城河是什么?
Hua Er Jie Jian Wen· 2025-06-09 10:47
当DeepSeek-R1以更低成本实现类似性能突破时,Claude能够连贯工作数小时完成复杂任务时,意味着AI发展已经迈入推理时代,强化学习技术的 重要性不言而喻,将重塑AI产业的技术栈乃至商业模式。 6月8日,AI研究公司SemiAnalysis发布长篇报告《强化学习:环境、奖励破解、智能体、扩展数据》,深度剖析了强化学习的工作原理以及影响 因素,并预测了后续AI发展趋势。 报告表示,强化学习(RL)或成为AGI前最后关键范式,其理密集型特性带来了算力挑战。此外,高质量数据是强化学习护城河,AI设计AI的循 环加速技术迭代。 1. 强化学习(RL)或成为AGI前最后关键范式:强化学习是推动大模型推理能力跃升的核心技术,尤其在思维链(CoT)生成和长 程任务连贯性上表现突出,被视作实现AGI前的终极技术路径。 2. 可验证奖励场景率先商业化:编码、数学等奖励函数明确的任务(如SWE-Bench性能提升30%+)已实现落地,OpenAI的o1、 DeepSeek-R1等模型验证其价值。医疗、写作等非验证领域通过"LLM评判者+人工评分标准"构建奖励函数(如HealthBench医疗 评估),OpenAI、阿里Q ...
质疑DeepSeek-R1、Claude Thinking根本不会推理!苹果争议论文翻车了?
机器之心· 2025-06-09 04:33AI Processing
具身智能推动实现通用人工智能
Ren Min Ri Bao Hai Wai Ban· 2025-06-09 04:19
Group 1 - The core idea of embodied intelligence emphasizes that cognition is influenced by the agent's perception and actions, suggesting that intelligence arises from the interaction between the agent's body and the surrounding environment, rather than solely from brain function [1][2] - Embodied intelligence theory has profound implications across various fields such as cognitive science, psychology, anthropology, and art, leading to the emergence of sub-disciplines like embodied cognition and embodied psychology [1][2] - The transition from traditional disembodied intelligence to modern embodied intelligence marks a significant shift in artificial intelligence research, where the latter integrates physical interaction with the environment for learning and decision-making [2][3] Group 2 - The history of artificial intelligence has evolved through three stages: the first generation focused on knowledge-based reasoning models, the second generation introduced data-driven models, and the third generation, marked by the emergence of large language models, represents a new phase of development [3][4] - The introduction of large language models in 2020 has enabled machines to achieve free interaction with humans in open domains, indicating a significant step towards general artificial intelligence [4][5] - Despite advancements in language generation, there are still limitations in achieving domain generality across various tasks, particularly in complex areas like medical diagnosis, highlighting the need for embodied intelligence to bridge these gaps [5][6] Group 3 - The concept of embodied intelligence was first proposed in the field of robotics, emphasizing the importance of the interaction between the body and the environment in intelligent behavior [6][7] - Embodied intelligence has driven advancements in robotics technology, shifting from single-modal perception to multi-modal perception, which is crucial for applications like autonomous vehicles [8][9] - The integration of the agent concept in embodied intelligence allows robots to combine thinking, perception, and action, facilitating tasks in both digital and physical worlds, and enhancing the efficiency of robotic development through simulation [9]
跻身史上最大私营融资!传Meta(META.US)拟豪掷数十亿美元投资Scale AI加码AI数据军备竞赛
智通财经网· 2025-06-09 00:01
智通财经APP获悉,据报道,Meta(META.US)正就向Scale AI进行数十亿美元投资展开谈判。这笔融资 估值可能超过100亿美元,使其成为有史以来规模最大的私营企业融资事件之一。2024年,Scale AI在一 轮包括Meta参与的投资中估值已达约140亿美元。 Scale首席执行官Alexandr Wang或许不像OpenAI的Sam Altman那样家喻户晓,但其公司已成为AI三大支 柱——芯片、人才和数据——中数据领域的绝对领导者。这家初创企业通过庞大外包团队,为Meta和 OpenAI等科技公司提供AI模型训练所需的数据标注服务,并协助开发定制化AI应用。据知情人士透 露,Scale正越来越多地招募博士、护士等高学历专家参与复杂模型的开发。 Scale的发展轨迹既受OpenAI引发的AI热潮影响,也反作用于这一趋势。早期,Scale更专注于标注汽 车、交通信号灯和路标的图像,以帮助训练用于构建自动驾驶汽车的模型。但此后,它开始帮助注释和 管理构建支撑ChatGPT等聊天机器人的所谓大型语言模型所需的海量文本数据。这些模型通过从数据及 其各自标签中提取模式来学习。 尽管面临对海外廉价劳工的心理 ...
为什么用错奖励,模型也能提分?新研究:模型学的不是新知识,是思维
机器之心· 2025-06-08 03:45
本文主要作者是吕昂和谢若冰。吕昂,中国人民大学博士生,研究方向为语言模型结构优化,导师为严睿教授;谢若冰,腾讯高级研究员,研究方向为大语言模 型、推荐系统。 最近的一篇论文中,来自人大和腾讯的研究者们的研究表明,语言模型对强化学习中的奖励噪音具有鲁棒性,即使翻转相当一部分的奖励(例如,正确答案得 0 分,错误答案得 1 分),也不会显著影响下游任务的表现。 研究者解释道,强化学习对下游任务的提升,关键不仅在于奖励的准确性,而更在于模型是否能够产生高质量的思考过程。仅通过奖励模型输出中关键思考词的 出现频率,而非基于答案正确性的奖励,语言模型依然能够在下游任务中取得非常高的峰值表现。这表明,强化学习对下游任务的提升,更多来源于让模型学会 采用恰当的思考路径接近正确答案。而相关的解题基础能力,模型已在预训练阶段获得。因此,预训练阶段的能力提升依然至关重要。 研究者还展示了基于思考模式的极简奖励如何有效校准奖励模型,从而在开放性 NLP 任务中增强语言模型的表现,并使较小的模型也能通过强化学习成功获得思 考能力。 论文地址:https://huggingface.co/papers/2505.22653 代码链接: ...