强化学习(RL)
Search documents
推荐系统进入「双动力」时代!首篇LLM-RL协同推荐综述深度解析
机器之心· 2026-03-03 02:55
强化学习(RL)将推荐系统建模为序列决策过程,支持长期效益和非连续指标的优化,是推荐系统领域的主流建模范式之一。然而,传统 RL 推荐系统受困 于状态建模难、动作空间大、奖励设计复杂、反馈稀疏延迟及模拟环境失真等瓶颈。近期,大语言模型(LLM)的崛起带来了新机遇。LLM 凭借常识储 备、推理能力和语义天赋,不仅能让智能体更懂用户,还能充当高保真的环境模拟器。LLM 与 RL 的结合开启了更加智能、稳健且可信的 LLM-RL 协同推 荐系统 新范式。 针对这一新兴方向,研究团队联合发布了首篇聚焦 LLM-RL 协同推荐的系统性综述。该论文创新性地提出五大主流协同范式,全面总结评估体系框架,深 入分析了当前关键挑战与未来发展路径,为该领域的研究者和工程师提供了一份从方法范式到评测体系、从研究现状到创新方向的一站式参考指南。 | (2)中国科学在术大学 | KUAISHOU | (2)中国人民大學 | 1 2 2 2 2 大 第 | (全) J. 女子, 3 | ▲ 最流形式大學 UNIVERSITY OF SCIENCE | | --- | --- | --- | --- | --- | --- | | [ Un ...
首次证实RL能让3D模型学会推理,复杂文本描述下生成质量跃升
3 6 Ke· 2026-02-27 02:33
图像生成用RL已经打出了漂亮的成绩单,那3D生成呢? 当GRPO让大模型在数学、代码推理上实现质变,研究团队率先给出答案——首个将强化学习系统性引入文本到3D自回归生成的研究正式诞生,并被 CVPR 2026接收。该研究不只是简单移植2D经验,而是针对3D生成的独特挑战,从奖励设计、算法选择、评测基准到训练范式,做了一套完整的系统性 探索。 核心矛盾在于:3D对象没有「标准视角」。一张图对不对,人一眼就能看出来;但一个3D物体,需要从多个视角同时评估几何一致性、纹理质感与语义 对齐——任何一个维度设计不当,训练就会崩。 更深层的问题是,3D生成模型在自回归解码时,每一个token都携带着对整体结构的隐式承诺。这种长程依赖让奖励信号的稀疏性问题在3D中比2D更加突 出——模型很难在中途感知到哪里出了问题。 研究团队将这个问题拆成四个维度系统研究: 奖励模型怎么设计——哪类奖励信号对3D生成最有效? RL算法怎么选——GRPO的哪些变体适合3D的序列特性? 为什么3D比2D难得多? RL在文本、图像生成上屡试不爽,但直接搬到3D行不通。 最出人意料的发现:通用大模型(Qwen2.5-VL)评估3D一致性,比专用模 ...
ICLR 2026 Workshop二轮征稿开启:聚焦终身智能体的学习、对齐、演化
机器之心· 2026-02-05 07:52
Core Insights - Artificial Intelligence is at a new turning point, with AI Agents based on Large Language Models (LLM), Reinforcement Learning (RL), and Embodied AI rapidly emerging, showcasing multi-dimensional capabilities such as planning, reasoning, tool usage, and autonomous decision-making [2] - The current mainstream paradigm faces critical bottlenecks, necessitating a shift towards Lifelong Agents that can continuously learn, align over the long term, evolve autonomously, perceive resources, and be sustainably deployed [2] Workshop Overview - The Lifelong Agent Workshop, initiated by institutions like UIUC, Edinburgh, Oxford, and Princeton during the ICLR 2026 conference, aims to create a cross-disciplinary forum to systematically advance the Lifelong Agent research paradigm [3] - The workshop will address key issues related to Lifelong Agents, including language intelligence, reinforcement learning, embodied systems, multi-agent collaboration, and AI for science, defining the next technological milestone for Agent development [3] Challenges in Lifelong Learning - The phenomenon of catastrophic forgetting remains a significant challenge when models face dynamic and out-of-distribution (OOD) tasks, leading to decreased alignment consistency as user goals, environmental feedback, and contextual constraints evolve over time [4] - Real-world operational constraints such as computational power, token, energy, and interaction costs hinder the sustainability of these systems [4] Workshop Details - The workshop is scheduled for April 26-27, 2026, in Rio de Janeiro, featuring a hybrid format for participation [8] - The expected attendance is between 200-400 in-person participants and 500-600 online attendees [8] Submission Topics - The workshop encourages cross-disciplinary research focused on long-term operational Agents, particularly in areas such as Lifelong Learning, Lifelong Alignment, Self-Evolving Agents, and Embodied & Real-World Lifelong Agents [7] - Specific topics include memory-augmented RL, continual exploration, user goal change modeling, and multi-agent lifelong collaboration ecosystems [9][10] Future Directions - Lifelong Agents represent an upgrade in intelligent paradigms, aiming to create stable, autonomous, and sustainably growing systems that can contribute to scientific discovery and cross-modal interaction [11] - The workshop seeks to push Lifelong Agents towards becoming the next significant advancement in the field, addressing challenges related to resource-constrained learning and reasoning [12]
Clawdbot 之后,我们离能规模化落地的 Agent 还差什么?
Founder Park· 2026-02-03 12:31
以下文章来源于Monolith砺思资本 ,作者MONOLITH 可以说,目前的 Agent 更多还是惊艳的 Demo,不是可以规模化的产品。 Monolith 砺思资本办了一场「After the Model」技术沙龙,聊了聊:Agent 离规模化落地还有哪些难题? 在活动中,一个被反复提及的观点是: Agent 需要是一个可持续工作的系统,而非单次任务的跑通。 这意味着,光有「模型智力」是远远不够的。想跨过工程这条鸿沟,必须还要「死磕」这几个硬指标: 稳定性、高吞吐量、成本控制、精确的状态管 理。 以下是活动的一些核心 Insight,供从业者参考。 Monolith砺思资本 . Monolith砺思资本是一家投资管理机构,覆盖一二级市场。遵循研究驱动的投资理念,投资技术与创新驱动的科技、软件、生命科学和消费领域。 对于个人极客来说,OpenClaw 是有趣的。但对于企业和商业环境来说,问题立刻暴露:昂贵(烧 Token)、不可控(安全边界模糊)、存在隐私问题, 且难以协作。 OpenClaw (原名 Clawdbot )爆火。 ⬆️关注 Founder Park,最及时最干货的创业分享 Founder P ...
当世界模型、VLA和强化学习三者结合起来,能取得什么惊艳效果?
具身智能之心· 2026-01-15 00:32
Core Insights - The article discusses the potential of the Vision-Language-Action (VLA) model in general robotic operations, highlighting its reliance on expert demonstration data which limits its ability to learn from failures and self-correct [2] - It introduces WMPO, a world model-based policy optimization method that enhances sample efficiency and overall performance in reinforcement learning (RL) without needing real-world interaction [3] Group 1 - The VLA model shows strong potential in robotic tasks but struggles with self-improvement due to its dependence on expert data [2] - Reinforcement learning can address the limitations of VLA models by enabling self-improvement through autonomous interaction with physical environments, although it faces high sample complexity when applied to real robots [2] - WMPO focuses on pixel-based prediction tasks, aligning "imagined" trajectories with VLA features pre-trained on large-scale network images, leading to superior performance compared to traditional offline methods [3] Group 2 - WMPO demonstrates significant advantages, including improved sample efficiency, better overall performance, emergence of self-correcting behaviors, and robust generalization and lifelong learning capabilities [3] - The article provides a link to the research paper on WMPO and its project homepage for further exploration [4]
华为推出软工代码智能体SWE-Lego,解锁SFT训练极致性能
机器之心· 2026-01-13 04:08
"软工任务要改多文件、多轮工具调用,模型怎么学透?高质量训练数据稀缺,又怕轨迹含噪声作弊?复杂 RL 训练成本高,中小团队望而却步?" 华为研究团队推出 SWE-Lego , 仅基于监督微调(SFT)的软件工程代码智能体,无需复杂 RL 流程,在 SWE-bench Verified 基准中斩获同等规模开源模型 SOTA,甚至超越部分更大规模闭源模型!项目已开源,代码、模型和 全部数据一键获取 ! SWE-Lego 具有 三大创新,包括数据、训练和测试时扩展。 1. 混合数据集构建: 3. 测试时扩展策略(TTS): 引言 在软件工程领域,Code Agent 需要处理复杂的任务:修复 bug、重构代码、理解大型代码库。这些任务要求 Code Agent 具备 长序列推理、多文件操作和工具使用 等能力。现有的训练方法通常需要复杂的训练范式,比如强化学习(RL)或者 RL 和 SFT 的迭代组合。 这些方法虽然有效,但计算成本高,训练过程复杂。能否用更简单的方法达到同样的效果? 华为的研究团队提出了 SWE-Lego,一个仅基于监督微调(SFT)的软工代码模型的解决方案 。在 SWE-bench Verifie ...
让机器人“舞得更好”的全身运控的方案还有哪些进化空间?
具身智能之心· 2026-01-04 00:32
Core Insights - The article discusses advancements in reinforcement learning (RL) and its integration with various models, particularly in the context of embodied intelligence and robotics. It highlights the importance of data quality for pretraining models and the innovative approaches being developed to enhance RL training paradigms [3][4][5]. Group 1: Reinforcement Learning Innovations - The discussion emphasizes the standardization of training paradigms in RL, particularly the use of imitation learning followed by reinforcement learning in simulated environments [3][4]. - A significant point raised is the introduction of the Simple Policy Optimization (SPO) algorithm, which has been recognized in the context of the Pi0.6 model, showcasing its application as a baseline for RL tasks [3][4]. - The article notes that the data used for pretraining models in different domains, such as language models and autonomous driving, varies significantly, affecting the quality and applicability of the models [4][5]. Group 2: Data Utilization and Challenges - The article highlights the challenge of utilizing real-world driving data for pretraining, noting that only about 1% of collected data is suitable for model training due to various imperfections [4][5]. - It discusses the potential of RL to evaluate and utilize suboptimal data, suggesting that even flawed data can contribute to learning processes, akin to how humans learn from mistakes [5][6]. - The need for effective data collection and utilization strategies in the field of embodied intelligence is emphasized, particularly in light of the high volume of discarded data during training processes [5][6]. Group 3: Framework Development - The article introduces the Rlinf framework, designed to support RL applications in visual language models (VLA), addressing the limitations of existing frameworks that do not cater to the specific needs of RL in VLA contexts [8][10]. - The framework aims to facilitate various RL methodologies, including on-policy and off-policy learning, and is built to accommodate diverse hardware requirements [10][11]. - The development of this framework is seen as a significant investment, reflecting the growing demand for robust RL tools in the field of embodied intelligence [10][11]. Group 4: Sim-to-Real Transfer and Practical Applications - The article discusses the challenges of sim-to-real transfer in robotics, particularly in tasks involving local motion and manipulation, where the gap between simulated and real-world performance remains substantial [19][29]. - It highlights the exploration of 3D generative models as a means to improve the realism of simulations, thereby enhancing the effectiveness of RL training [24][25]. - The integration of advanced perception technologies, such as dual-camera systems, is noted as a promising approach to bridge the sim-to-real gap, facilitating better performance in real-world applications [22][29].
大模型“缩放定律”悖论:RL(强化学习)越强,AGI(通用智能)越远?
硬AI· 2025-12-24 08:10
Core Argument - The over-reliance on Reinforcement Learning (RL) in AI development may be leading the industry away from achieving Artificial General Intelligence (AGI), as current models lack the ability to learn autonomously from experience like humans do [3][4]. Group 1: Skills Preconditioning Paradox - Current AI models depend on "pre-baked" skills, such as using Excel or browsing the web, which highlights their lack of general learning capabilities, indicating that AGI is not imminent [5]. - The approach of embedding specific skills into models contradicts the essence of human-like learning, which does not require extensive pre-training for every task [4][17]. Group 2: Insights from Robotics - The challenges in robotics stem from algorithmic issues rather than hardware limitations; if AI had human-like learning capabilities, robots would already be widely adopted without the need for repetitive training [6][13]. Group 3: Economic Implications of AI - The argument that "technology diffusion takes time" is seen as a self-comforting excuse; if models truly possessed human-like intelligence, they would be rapidly adopted by businesses due to lower risks and no training requirements [7][19]. - The disparity between the value created by global knowledge workers, amounting to trillions of dollars, and the significantly lower revenue generated by AI models indicates that these models have not yet reached the threshold to replace human workers [8][22]. Group 4: The Importance of Continual Learning - The key bottleneck for achieving AGI lies in the ability for "Continual Learning," rather than merely stacking RL computational power; true AGI may take another 10 to 20 years to realize [9][25]. - The process of solving the continual learning problem is expected to be gradual, similar to the evolution of context learning capabilities, and may not yield immediate breakthroughs [29][30].
今年的VLA+RL的工作正在排队等着录用......
具身智能之心· 2025-12-24 00:25
Core Insights - The article emphasizes the importance of Reinforcement Learning (RL) in enhancing the generalization capabilities of Vision-Language-Action (VLA) models, with some experiments showing performance improvements of up to 42.6% on out-of-distribution tasks [2]. Group 1: VLA and RL Integration - VLA models are currently reliant on RL to overcome limitations in real-world out-of-distribution scenarios, where imitation learning alone proves insufficient [2]. - Recent advancements in VLA+RL frameworks have led to significant breakthroughs, with several notable papers published this year [2]. - Tools supporting VLA+RL frameworks are evolving, with recommendations for resources like Rlinf, which offers a growing number of supported methods [2]. Group 2: Notable Research Papers - A summary of representative VLA+RL research papers from the past two years is provided, highlighting their contributions to the field [5]. - Key papers include "NORA-1.5," which focuses on a VLA model trained using world model and action-based preference rewards, and "Balancing Signal and Variance," which discusses adaptive offline RL post-training for VLA flow models [5][10]. - Other significant works include "ReinboT," which enhances robot visual-language manipulation through RL, and "WMPO," which optimizes policies based on world models for VLA [8][10]. Group 3: Future Research Directions - The article suggests that future research should align with the advancements in VLA and RL, encouraging collaboration and consultation for those interested in exploring these areas [3].
今年大概率产了n篇VLA+RL工作吧?!
自动驾驶之心· 2025-12-23 03:43
Core Insights - The article emphasizes the importance of Reinforcement Learning (RL) in enhancing the generalization capabilities of Vision-Language-Action (VLA) models, with some experiments showing performance improvements of up to 42.6% on out-of-distribution tasks [2]. Group 1: VLA and RL Integration - VLA models are currently reliant on RL to overcome limitations in real-world out-of-distribution scenarios, where imitation learning alone proves insufficient [2]. - Recent advancements in VLA+RL frameworks have led to significant breakthroughs, with several notable papers published this year [2]. - Tools supporting VLA+RL frameworks, such as Rlinf, are becoming increasingly comprehensive, offering a variety of methods for researchers [2]. Group 2: Notable Research Papers - A summary of representative VLA+RL research papers from the past two years is provided, indicating a growing body of work in this area [5]. - Specific papers mentioned include "NORA-1.5," "Balancing Signal and Variance," and "CO-RFT," which focus on different aspects of VLA and RL integration [5][10]. - The article encourages further research in these areas and offers assistance for those looking to explore VLA, real2sim2real, and RL [3].