Workflow
多智能体强化学习
icon
Search documents
SPIRAL:零和游戏自对弈成为语言模型推理训练的「免费午餐」
机器之心· 2025-07-30 05:13
Core Insights - The research introduces SPIRAL, a framework that utilizes self-play in zero-sum games to enhance reasoning capabilities in language models without relying on human supervision [3][33]. - The study demonstrates that competitive self-play can lead to significant improvements in reasoning skills, as evidenced by a 8.7% increase in mathematical reasoning ability and an 18.1 percentage point improvement on the Minerva Math benchmark [7][30]. Group 1: Research Background - The collaborative research involves institutions such as the National University of Singapore and A*STAR, focusing on scalable autonomous agents capable of intelligent decision-making in unknown environments [1]. - The success of models like OpenAI's o1 and DeepSeek-R1 highlights the potential of reinforcement learning to enhance reasoning capabilities in language models [2]. Group 2: SPIRAL Framework - SPIRAL employs self-play in zero-sum games to autonomously discover and reinforce generalizable reasoning patterns, eliminating the need for manually designed reward functions and expert supervision [3][6]. - The framework utilizes a distributed online multi-agent reinforcement learning system for fine-tuning large language models across various two-player zero-sum games [24]. Group 3: Game-Based Training - The research identifies three games with distinct cognitive demands—TicTacToe, Kuhn Poker, and Simple Negotiation—as effective training environments for enhancing reasoning skills [12][11]. - The self-play mechanism allows for adaptive difficulty adjustments, ensuring continuous evolution of the model's capabilities [11]. Group 4: Transfer of Skills - The study reveals that reasoning patterns developed in games can transfer to mathematical problem-solving, with specific skills like expected value calculation and case analysis showing significant migration rates [18][19]. - The multi-game training approach leads to synergistic effects, enhancing performance in unfamiliar games compared to single-game specialists [21]. Group 5: Technical Innovations - The introduction of Role-Aware Advantage Estimation (RAE) prevents "thinking collapse," ensuring stable gradient updates and consistent reasoning generation throughout training [26][28]. - The SPIRAL framework has shown effectiveness even in strong models, with notable performance improvements in established benchmarks [30]. Group 6: Practical Implications - SPIRAL offers a novel approach for researchers and engineers aiming to enhance model reasoning capabilities without the need for extensive high-quality reasoning data [35]. - The findings suggest that pre-trained models already contain various reasoning patterns, and reinforcement learning can help identify and strengthen those that are truly generalizable [35]. Group 7: Limitations and Future Directions - Despite its successes, SPIRAL faces limitations such as the need for carefully designed game environments and high computational resource demands [38]. - Future research may explore hybrid game types and meta-game learning to cultivate more comprehensive reasoning abilities [37].
Meta-Think ≠ 记套路,多智能体强化学习解锁大模型元思考泛化
机器之心· 2025-07-03 03:26
本文第一作者为上海交通大学计算机科学四年级博士生万梓煜,主要研究方向为强化学习、基础模型的复杂推理,通讯作者为上海交通大学人工智能学院温颖副 教授和上海人工智能实验室胡舒悦老师。团队其他成员包括来自英属哥伦比亚大学的共同第一作者李云想、Mark Schmidt 教授,伦敦大学学院的宋研、杨林易和 汪军教授,上海交通大学的温潇雨,王翰竟和张伟楠教授。 引言 最近,关于大模型推理的测试时间扩展(Test time scaling law )的探索不断涌现出新的范式,包括① 结构化搜索结(如 MCTS),② 过程奖励模型(Process Reward Model )+ PPO,③ 可验证奖励 (Verifiable Reward)+ GRPO(DeepSeekR1)。然而,大模型何时产生 "顿悟(AhaMoment)" 的机理仍未明晰。近期多 项研究提出推理模式(reasoning pattern)对于推理能力的重要作用。类似的,本研究认为 大模型复杂推理的能力强弱本质在于元思维能力的强弱。 所谓 "元思维" (meta-thinking),即监控、评估和控制自身的推理过程,以实现更具适应性和有效性的问题解决,是智 ...
京东集团算法总监韩艾将在 AICon 北京站分享基于强化学习的异构多智能体联合进化算法
AI前线· 2025-06-20 02:47
6 月 27 日 -6 月 28 日, AICon 全球人工智能开发与应用大会北京站 即将拉开帷幕。本次大会 将汇聚 AI 前沿技术与落地实践,邀请来自腾讯、阿里、百度、字节跳动等头部大厂以及智谱、 硅基流动、智象未来、声智科技等 AI 企业的 50+ 资深专家,深度探讨 AI Agent、多模态应用、 推理性能优化以及 AI 在软件研发、数据分析、业务运营等场景的具体落地实践。 京东集团算法总监韩艾已确认出席并发表题为《 JDAgents-R1:基于强化学习的异构多智能体 联合进化算法 》的主题分享。多智能体强化学习(MARL)已成为处理日益复杂任务的重要范 式。然而,异构智能体之间的联合进化仍面临合作效率低与训练不稳定等挑战。为此,京东提出 了 一 种 面 向 MARL 的 联 合 进 化 算 法 框 架 JDAgents-R1 , 该 方 法 首 次 将 组 相 对 策 略 优 化 (GRPO)应用于异构多智能体的联合训练中。通过迭代优化智能体的大语言模型(LLMs)与自 适应记忆机制,JDAgents-R1 实现了决策能力与记忆能力的动态均衡,并能有效减少重复推理、 加快训练收敛。在通用场景以及商家定 ...
中国AI门派:汪军与他的学生们
投资界· 2025-03-04 07:41
以下文章来源于雷峰网 ,作者赖文昕 雷峰网 . 洞见智能未来,共与产业变迁 中国强化学习研究的半壁江山。 作者 | 赖文昕 编辑丨陈彩娴 来源 | 雷峰网 (ID:leiphone-sz) 作为一支在 AI 领域历经数十年的研究分支,强化学习仍在历久弥新。 从推荐系统到强化学习 2006 年暑假的一个午后,汪军踏上了从荷兰小城代尔夫特开往首都阿姆斯特丹的火 车,他将在阿姆斯特丹换乘飞机,飞往美国西雅图参加第 29 届国际计算机协会信息检 索大会(ACM SIGIR)。 此时的信息检索领域如日中天,加上微软、雅虎和谷歌三巨头最核心的业务也是搜索, ACM SIGIR 每年都能汇集学术界与工业界的最高人才,来开一场信息检索界的"年 会"。 在华盛顿大学的会场里,汪军在一片掌声中获得了最佳博士联盟奖,于博士毕业的前一 年拿下了信息检索领域博士的最高荣誉。 这位意气风发的青年此刻并未想到,自己将会在 15 年后再获得时间检验奖的荣誉提名 ——2021 年的汪军已转向强化学习(RL)数年,作为发起人之一成立了华人强化学习 社区RL China,为国内强化学习研究培养了一批优秀的青年人才,成为领域的"一代宗 师"。 汪军 ...
UCL强化学习派:汪军与他的学生们
雷峰网· 2025-02-27 10:15
Core Viewpoint - The article discusses the evolution and significance of reinforcement learning (RL) in China, highlighting key figures and their contributions to the field, particularly focusing on Wang Jun and his influence on the development of RL research and education in China [2][46]. Group 1: Historical Context and Development - Wang Jun's journey in AI began with information retrieval and recommendation systems, where he achieved significant academic recognition [4][8]. - His transition to reinforcement learning was influenced by his experiences in advertising, where he recognized the parallels between decision-making in advertising and RL principles [12][14]. - The establishment of the RL China community marked a pivotal moment in promoting RL research and education in China, addressing the lack of resources and formal education in the field [49][50]. Group 2: Contributions and Innovations - Wang Jun and his students have made substantial contributions to RL, including the development of SeqGAN and IRGAN, which integrate RL with generative adversarial networks for improved performance in various applications [23][24]. - The introduction of multi-agent systems in RL research has been a significant focus, with applications in complex environments such as advertising and gaming [27][28]. - The establishment of MediaGamma allowed for practical applications of RL in real-time advertising, showcasing the commercial viability of RL algorithms [17][18]. Group 3: Educational Initiatives and Community Building - The formation of RL China has facilitated knowledge sharing and collaboration among researchers and students, significantly enhancing the learning environment for RL in China [49][52]. - The publication of "Hands-On Reinforcement Learning" has provided accessible educational resources, bridging the gap between theory and practice for students [53]. - Wang Jun's mentorship has fostered a new generation of RL researchers, emphasizing the importance of exploration and innovation in academic pursuits [26][43]. Group 4: Future Directions and Challenges - The integration of RL with large models and embodied intelligence represents a promising frontier for future research, aiming to address the challenges of generalization across different tasks and environments [56][62]. - The ongoing exploration of RL applications in real-world scenarios, such as robotics and automated decision-making, highlights the potential for RL to impact various industries significantly [61][62]. - Despite setbacks in some projects, the commitment to advancing RL research and its applications remains strong among Wang Jun and his students, indicating a resilient and forward-looking approach to the field [56][62].