强化学习
Search documents
AGI最后拼图,一文看懂什么是强化学习?其护城河是什么?
Hua Er Jie Jian Wen· 2025-06-09 10:47
当DeepSeek-R1以更低成本实现类似性能突破时,Claude能够连贯工作数小时完成复杂任务时,意味着AI发展已经迈入推理时代,强化学习技术的 重要性不言而喻,将重塑AI产业的技术栈乃至商业模式。 6月8日,AI研究公司SemiAnalysis发布长篇报告《强化学习:环境、奖励破解、智能体、扩展数据》,深度剖析了强化学习的工作原理以及影响 因素,并预测了后续AI发展趋势。 报告表示,强化学习(RL)或成为AGI前最后关键范式,其理密集型特性带来了算力挑战。此外,高质量数据是强化学习护城河,AI设计AI的循 环加速技术迭代。 1. 强化学习(RL)或成为AGI前最后关键范式:强化学习是推动大模型推理能力跃升的核心技术,尤其在思维链(CoT)生成和长 程任务连贯性上表现突出,被视作实现AGI前的终极技术路径。 2. 可验证奖励场景率先商业化:编码、数学等奖励函数明确的任务(如SWE-Bench性能提升30%+)已实现落地,OpenAI的o1、 DeepSeek-R1等模型验证其价值。医疗、写作等非验证领域通过"LLM评判者+人工评分标准"构建奖励函数(如HealthBench医疗 评估),OpenAI、阿里Q ...
质疑DeepSeek-R1、Claude Thinking根本不会推理!苹果争议论文翻车了?
机器之心· 2025-06-09 04:33AI Processing
具身智能推动实现通用人工智能
Ren Min Ri Bao Hai Wai Ban· 2025-06-09 04:19
Group 1 - The core idea of embodied intelligence emphasizes that cognition is influenced by the agent's perception and actions, suggesting that intelligence arises from the interaction between the agent's body and the surrounding environment, rather than solely from brain function [1][2] - Embodied intelligence theory has profound implications across various fields such as cognitive science, psychology, anthropology, and art, leading to the emergence of sub-disciplines like embodied cognition and embodied psychology [1][2] - The transition from traditional disembodied intelligence to modern embodied intelligence marks a significant shift in artificial intelligence research, where the latter integrates physical interaction with the environment for learning and decision-making [2][3] Group 2 - The history of artificial intelligence has evolved through three stages: the first generation focused on knowledge-based reasoning models, the second generation introduced data-driven models, and the third generation, marked by the emergence of large language models, represents a new phase of development [3][4] - The introduction of large language models in 2020 has enabled machines to achieve free interaction with humans in open domains, indicating a significant step towards general artificial intelligence [4][5] - Despite advancements in language generation, there are still limitations in achieving domain generality across various tasks, particularly in complex areas like medical diagnosis, highlighting the need for embodied intelligence to bridge these gaps [5][6] Group 3 - The concept of embodied intelligence was first proposed in the field of robotics, emphasizing the importance of the interaction between the body and the environment in intelligent behavior [6][7] - Embodied intelligence has driven advancements in robotics technology, shifting from single-modal perception to multi-modal perception, which is crucial for applications like autonomous vehicles [8][9] - The integration of the agent concept in embodied intelligence allows robots to combine thinking, perception, and action, facilitating tasks in both digital and physical worlds, and enhancing the efficiency of robotic development through simulation [9]
跻身史上最大私营融资!传Meta(META.US)拟豪掷数十亿美元投资Scale AI加码AI数据军备竞赛
智通财经网· 2025-06-09 00:01
智通财经APP获悉,据报道,Meta(META.US)正就向Scale AI进行数十亿美元投资展开谈判。这笔融资 估值可能超过100亿美元,使其成为有史以来规模最大的私营企业融资事件之一。2024年,Scale AI在一 轮包括Meta参与的投资中估值已达约140亿美元。 Scale首席执行官Alexandr Wang或许不像OpenAI的Sam Altman那样家喻户晓,但其公司已成为AI三大支 柱——芯片、人才和数据——中数据领域的绝对领导者。这家初创企业通过庞大外包团队,为Meta和 OpenAI等科技公司提供AI模型训练所需的数据标注服务,并协助开发定制化AI应用。据知情人士透 露,Scale正越来越多地招募博士、护士等高学历专家参与复杂模型的开发。 Scale的发展轨迹既受OpenAI引发的AI热潮影响,也反作用于这一趋势。早期,Scale更专注于标注汽 车、交通信号灯和路标的图像,以帮助训练用于构建自动驾驶汽车的模型。但此后,它开始帮助注释和 管理构建支撑ChatGPT等聊天机器人的所谓大型语言模型所需的海量文本数据。这些模型通过从数据及 其各自标签中提取模式来学习。 尽管面临对海外廉价劳工的心理 ...
为什么用错奖励,模型也能提分?新研究:模型学的不是新知识,是思维
机器之心· 2025-06-08 03:45
本文主要作者是吕昂和谢若冰。吕昂,中国人民大学博士生,研究方向为语言模型结构优化,导师为严睿教授;谢若冰,腾讯高级研究员,研究方向为大语言模 型、推荐系统。 最近的一篇论文中,来自人大和腾讯的研究者们的研究表明,语言模型对强化学习中的奖励噪音具有鲁棒性,即使翻转相当一部分的奖励(例如,正确答案得 0 分,错误答案得 1 分),也不会显著影响下游任务的表现。 研究者解释道,强化学习对下游任务的提升,关键不仅在于奖励的准确性,而更在于模型是否能够产生高质量的思考过程。仅通过奖励模型输出中关键思考词的 出现频率,而非基于答案正确性的奖励,语言模型依然能够在下游任务中取得非常高的峰值表现。这表明,强化学习对下游任务的提升,更多来源于让模型学会 采用恰当的思考路径接近正确答案。而相关的解题基础能力,模型已在预训练阶段获得。因此,预训练阶段的能力提升依然至关重要。 研究者还展示了基于思考模式的极简奖励如何有效校准奖励模型,从而在开放性 NLP 任务中增强语言模型的表现,并使较小的模型也能通过强化学习成功获得思 考能力。 论文地址:https://huggingface.co/papers/2505.22653 代码链接: ...
强化学习之父Richard Sutton:人类数据耗尽,AI正在进入“经验时代”!
AI科技大本营· 2025-06-06 10:18
Core Viewpoint - The article emphasizes that true intelligence in AI should stem from experience rather than pre-set human data and knowledge, marking a shift towards an "Era of Experience" in AI development [5][16]. Summary by Sections Introduction to the Era of Experience - The current era in AI is characterized by a transition from reliance on human-generated data to a focus on experiential learning, where AI systems learn through interaction with the world [9][16]. Key Insights from Richard Sutton's Speech - Richard Sutton argues that genuine AI must have a dynamic data source that evolves with its capabilities, as static datasets will become inadequate [6][9]. - He highlights that the essence of intelligence lies in the ability to predict and control sensory inputs, which is fundamental to AI and intelligence [13]. The Learning Process - The learning process in both humans and animals is based on interaction with the environment, where actions determine the information received, leading to a deeper understanding [10][11]. - Sutton illustrates that AI should emulate this learning process by engaging with the world to generate new data and enhance its capabilities [10][12]. Transition from Human Data to Experience - The article outlines a timeline of AI evolution, indicating that the current "Human Data Era" is nearing its end, paving the way for the "Experience Era" where AI learns through real-world interactions [14][16]. - Sutton emphasizes that the future of AI lies in its ability to continuously learn from experiences, which is essential for unlocking the full potential of the "Experience Era" [17]. Decentralized Cooperation - The concept of "decentralized cooperation" is introduced as a framework for understanding social organization, where multiple agents pursue their own goals while collaborating for mutual benefit [24][25]. - Sutton argues that human prosperity and the future of AI should be built on this foundation of decentralized cooperation rather than centralized control [27][28]. Conclusion - The article concludes by encouraging a shift in perspective towards viewing interactions between humans and AI through the lens of decentralized cooperation versus centralized control, which could provide valuable insights into future developments in AI [28].
类R1训练不再只看结果对错!港中文推出SophiaVL-R1模型
机器之心· 2025-06-06 09:36
Core Insights - The article discusses the evolution of reasoning models, particularly focusing on the introduction of the SophiaVL-R1 model, which incorporates a "thinking reward" mechanism to enhance reasoning quality and generalization capabilities [3][5][13]. Group 1: Model Development - The SophiaVL-R1 model represents a significant advancement over previous models by not only rewarding correct answers but also evaluating the reasoning process behind those answers [3][7]. - This model has demonstrated superior performance in various mathematical and multimodal benchmark tests, outperforming larger models such as LLaVA-OneVision-72B, which has ten times the parameters [5][20]. Group 2: Thinking Reward Mechanism - The introduction of the "thinking reward" mechanism allows for a more comprehensive assessment of the reasoning process, ensuring that models learn effective reasoning strategies rather than relying on shortcuts [7][13]. - A specially designed dataset was created to score the reasoning processes, which includes diverse thinking patterns and errors, leading to the development of a "thinking scoring model" [10][11]. Group 3: Trust-GRPO Algorithm - To address the issue of reward hacking, the SophiaVL-R1 model employs the Trust-GRPO training algorithm, which assesses the credibility of thinking rewards based on comparative analysis of correct and incorrect answers [17][18]. - This algorithm enhances the stability and reliability of the training process by adjusting the credibility scores of rewards when discrepancies are detected [18]. Group 4: Performance Metrics - In various evaluation benchmarks, SophiaVL-R1-7B has shown remarkable reasoning and generalization abilities, achieving scores that directly compete with or exceed those of significantly larger models [20][21]. - The model's performance in specific benchmarks includes a score of 61.3 in MMMU and 2403.8 in MME, showcasing its effectiveness [21][23]. Group 5: Experimental Validation - Ablation studies indicate that all components of the SophiaVL-R1 model contribute effectively to its overall performance, with evidence showing faster and better training outcomes [22][23].
阿里智能体多轮推理超越GPT-4o,开源模型也能做Deep Research
量子位· 2025-06-06 04:01
Group 1 - The core viewpoint of the article is the introduction of WebDancer, an advanced autonomous information retrieval agent developed by Tongyi Lab, which addresses the growing demand for multi-step information retrieval capabilities in an era of information overload [1][2][3]. Group 2 - Background: The traditional search engines are insufficient for users' needs for deep, multi-step information retrieval across various fields such as medical research, technological innovation, and business decision-making [3]. - Challenges: Building autonomous agents faces significant challenges, particularly in obtaining high-quality training data necessary for complex multi-step reasoning [4]. Group 3 - Innovative Data Synthesis: WebDancer proposes two innovative data synthesis methods, ReAct framework and E2HQA, to address data scarcity [5][6]. - ReAct Framework: This framework involves a cycle of Thought-Action-Observation, enabling the agent to generate thoughts, take structured actions, and receive feedback iteratively [5]. Group 4 - Training Strategies: WebDancer employs a two-phase training strategy, including Supervised Fine-Tuning (SFT) and Reinforcement Learning (RL), to enhance the agent's adaptability and decision-making capabilities in dynamic environments [12][13]. - Data Quality Assurance: A multi-stage data filtering strategy is implemented to ensure high-quality training data, enhancing the agent's learning efficiency [9][10]. Group 5 - Experimental Results: WebDancer has demonstrated outstanding performance in various information retrieval benchmark tests, particularly excelling in the GAIA and WebWalkerQA datasets [17][18][19]. - Performance Metrics: The best-performing models achieved a Pass@3 score of 61.1% on the GAIA benchmark and 54.6% on the WebWalkerQA benchmark, showcasing their robust capabilities [20]. Group 6 - Future Prospects: WebDancer aims to integrate more complex tools and expand its capabilities to handle open-domain long-text writing tasks, enhancing the agent's reasoning and generative abilities [29][30]. - Emphasis on Agentic Models: The focus is on developing foundational models that inherently support reasoning, decision-making, and multi-step tool invocation, reflecting a philosophy of simplicity and universality in engineering [30][31].
赛道Hyper | 字节跳动VMR²L系统实现工程秒级推理
Hua Er Jie Jian Wen· 2025-06-06 03:22
作者:周源/华尔街见闻 VMR²L是一种虚拟机重调度系统,全称Versatile Multi-agent Reinforcement Learning with Real-time Reasoning,直译就是:具备实时推理能力的、通用多智能体强化学习系统。 此外还有两阶段智能体架构,通过显式约束过滤非法动作,自然满足资源容量、亲和性限制等工业级调 度规则,在不同负载场景下泛化误差小于5%。 测试数据显示,在典型云计算集群中,VMR²L可将资源利用率提升18%-22%,迁移时间从分钟级降至 秒级,为高密度数据中心的实时资源调度提供了可行方案。 6月5日,字节跳动技术团队微信公众号发文称,由字节跳动ByteBrain团队主导,联合加州大学默塞德 分校(UC Merced)与伯克利分校(UC Berkeley),提出了VMR²L,研发出一套基于深度强化学习的 VMR系统:在保持近似最优性能的同时,将推理时间压缩至1.1秒,成功实现系统性能与工业可部署性 的统一。 通过深度强化学习技术,VMR²L将虚拟机资源调度的推理时间压缩至1.1秒,同时保持与传统混合整数 规划(MIP)方法相近的资源优化效果,为云计算、数据中 ...
12.1万高难度数学题让模型性能大涨,覆盖FIMO/Putnam等顶级赛事难度,腾讯上海交大出品
量子位· 2025-06-06 00:58
DeepTheorem团队 投稿 量子位 | 公众号 QbitAI 12.1万道IMO级难度数学"特训题",让AI学会像人类一样 推导数学证明 ! "特训"过后,模型定理证明性能大涨 ,7B模型性能比肩或超越现有的开源模型和Claude3.7等商业模型 。 "特训题"为 Deep Theore m ,是首个基于自然语言的数学定理证明框架与数据集,由腾讯AI Lab与上海交大团队联合推出。 团队表示,定理证明是数学前沿的重要组成部分,但当前大语言模型 (LLM) 在数学推理,特别是通过强化学习 (RL) 进行训练时,往往 需要可以自动验证的答案,导致大模型无法像数学家那样通过自然语言进行定理证明。 图(b)展示经过强化学习训练的DeepTheorem-7B模型性能,比肩或超越现有的开源模型和商业模型 (Gemini2.0-flash, Qwen2.5-72B- Instruct, Claude3.7 等 ) ,仅次于o1、o3以及Gemini2.5-pro强推理模型。 DeepTheorem-121K 1、规模与难度:专为"极限挑战"而生 DeepTheorem训练集的显著特点是其大规模与高难度。其包含121K ...