Workflow
强化学习
icon
Search documents
开源RL框架Verlog来了,专为LLM智能体打造,400回合不成问题
机器之心· 2025-10-08 04:13
Core Insights - The article discusses the challenges faced by intelligent agents in maintaining clear reasoning and robust decision-making over long-term tasks, particularly when the task extends to hundreds of steps [2][3] - It introduces Verlog, a multi-turn reinforcement learning framework designed to handle long-horizon tasks effectively, overcoming limitations of traditional frameworks [3][20] Group 1: Framework Overview - Verlog is built on the foundations of VeRL and BALROG, incorporating specialized optimization techniques to ensure stable and efficient training across tasks that can extend beyond 400 steps [3][20] - The framework has been validated in complex environments such as BabyAI, BabaIsAI, and Crafter, demonstrating strong performance in tasks with varying episode lengths [3][19] Group 2: Methodology - The base model for Verlog is the Qwen-2.5 Instruct variant, which allows seamless integration with BALROG and facilitates the use of benchmark testing prompts with minimal modifications [6][7] - A memory mechanism is employed to retain only the latest n + 1 rounds of interactions, optimizing performance for the 3B parameter Qwen model [9][10] Group 3: Algorithmic Innovations - The Dual Discounting GAE algorithm is introduced to decouple tokens from steps, encouraging agents to complete tasks in fewer environment steps [11][20] - The recursive calculation of GAE enhances the stability of training, allowing for effective learning even in sparse reward scenarios [12][14] Group 4: Experimental Results - Verlog was tested on three challenging benchmarks: Crafter, BabyAI, and BabaIsAI, showcasing its ability to adapt to long-duration tasks with sparse rewards [16][19] - The training of the Qwen2.5-7B-Instruct model in the Crafter environment utilized 8 H100 GPUs over approximately 36 hours, while the Qwen2.5-3B-Instruct model for BabyAI and BabaIsAI was trained on 4 A40 GPUs for about 24 hours [19] Group 5: Future Directions - Verlog aims to serve as a flexible research platform to advance the development of long-horizon LLM-Agent reinforcement learning [21][20] - The framework addresses key engineering challenges such as managing long interaction histories, ensuring training stability under sparse rewards, and handling variable trajectory lengths [20][23]
我们正在找具身领域的合伙人......
具身智能之心· 2025-10-08 02:49
Core Viewpoint - The company is seeking collaboration with global practitioners in the embodied intelligence field to enhance capabilities in various areas such as technical services, training, course development, and research guidance [1]. Group 1: Collaboration Opportunities - There is an increasing demand from partners and small companies for the company to empower them through solutions, data collection, technology upgrades, and corporate training [1]. - The company is inviting outstanding partners to join in driving significant industry progress [1]. Group 2: Compensation and Resources - The company will offer high compensation and abundant industry resources to collaborators [2]. Group 3: Focus Areas - Key focus areas for collaboration include but are not limited to: VLA, VLN, Diffusion Policy, Reinforcement Learning, VLA+RL, remote operation, motion capture, sim2real, multimodal large models, simulation, motion control, end-to-end systems, and 3D perception [3]. Group 4: Job Description - The positions are primarily aimed at embodied course development, solution research and development, hardware development, and training collaboration, targeting both B-end (enterprises, universities, research institutes) and C-end (students, job seekers) [4]. Group 5: Contact Information - Interested parties can add WeChat oooops-life for further inquiries [5].
“盲眼”机器人在完全看不见的情况下30秒跑酷首秀惊艳!
具身智能之心· 2025-10-07 03:03
Core Insights - The article discusses the advancements in humanoid robotics, specifically focusing on Amazon's FAR (Frontier AI for Robotics) team and their new technology, OmniRetarget, which enables robots to perform complex tasks without visual sensors [9][49]. Group 1: OmniRetarget Technology - OmniRetarget allows reinforcement learning strategies to learn long-term loco-manipulation skills in complex environments, achieving zero-shot transfer from simulation to humanoid robots [12][29]. - The technology utilizes an interaction mesh to model spatial and contact relationships between the robot, objects, and terrain, enhancing data efficiency and reducing data collection costs [15][25]. - OmniRetarget outperforms other motion redirection methods in key areas such as hard constraints, object interaction, terrain interaction, and data augmentation [16][40]. Group 2: Experimental Results - The research team demonstrated the broad capabilities of OmniRetarget, including natural object manipulation and terrain interaction, achieving a success rate of 79.1% on enhanced datasets [39][42]. - In comparative tests, OmniRetarget showed superior performance in kinematic quality metrics, such as penetration and contact preservation, outperforming baseline methods [41][42]. - The technology's high-quality redirection actions directly improve downstream reinforcement learning policy success rates by over 10% compared to baseline methods [42]. Group 3: Team and Background - Amazon's FAR team, established recently, is led by prominent scholars from the robotics field, including those from the renowned Covariant company [43][44]. - The team aims to revolutionize automation in humanoid robotics, marking Amazon's first significant foray into this area [49][50].
亚马逊“盲眼”机器人30秒跑酷首秀惊艳!华人学者领衔
量子位· 2025-10-06 05:42
henry 发自 凹非寺 量子位 | 公众号 QbitAI 你见过这样的"盲眼"机器人demo吗? 它在完全看不见的情况下——没有摄像头、雷达或任何感知单元——主动搬起9斤重的椅子,爬上1米高的桌子,然后翻跟头跳下。 不光耍酷,干起活来,搬箱子也不在话下。 还能一个猛子跳上桌子。 手脚并用爬坡也照样OK。 这些丝滑小连招来自 亚马逊机器人团队FAR (Frontier AI for Robotics)发布的 首个 人形机器人(足式)研究成果—— OmniRetarget ! OmniRetarget使强化学习策略能够在复杂环境中学习长时程的"移-操一体"(loco-manipulation)技能,并实现从仿真到人形机器人的零样本 迁移。 网友表示:又能跑酷、还能干活,这不比特斯拉的擎天柱强10倍? 此外,保留任务相关的交互使得数据能够进行高效的数据增强,进而从单个演示推广到不同的机器人本体、地形和物体配置,以减少不同变体 的数据收集成本。 在与其他动作重定向方法的对比中,OmniRetarget在所有关键方面:硬约束、物体交互、地形交互、数据增强表现出了全面的方法优势。 | Methods | Hard Ki ...
强化学习在机械臂、四足、人形的应用有哪些?
具身智能之心· 2025-10-05 16:03
Core Viewpoint - The article discusses the importance of reinforcement learning (RL) in the development of embodied intelligent robots, highlighting its applications in various complex tasks and the challenges faced by newcomers in the field [3][4][10]. Group 1: Reinforcement Learning Applications - Reinforcement learning is crucial for gait control in humanoid and quadruped robots, enabling them to perform tasks such as climbing stairs, running, and dancing [3][9]. - The VLA+RL approach for robotic arms is gaining popularity in academia, enhancing the efficiency and smoothness of robot operations [4][9]. Group 2: Challenges in Learning and Research - The complexity and breadth of reinforcement learning make it difficult for beginners to enter the field, often leading to frustration and abandonment of studies [6][10]. - A lack of a comprehensive learning system can result in repeated mistakes and missed opportunities for aspiring researchers [7][10]. Group 3: Educational Offerings - To address the challenges faced by newcomers, the company has launched a 1v6 paper guidance small class in the field of reinforcement learning, aimed at graduate students and others needing paper guidance [7][8]. - The course includes 14 weeks of concentrated online guidance followed by 8 weeks of maintenance support, focusing on paper idea confirmation, project implementation, experimental guidance, and writing refinement [10][12]. Group 4: Course Structure and Content - The course covers various topics, including paper direction and submission analysis, reinforcement learning basics, simulation environments, and writing guidance [10][18]. - Students will have the opportunity to work on specific ideas related to quadruped robots, humanoid robots, and robotic arms, with a structured approach to developing a paper suitable for submission to top conferences [19][30]. Group 5: Expected Outcomes - Participants are expected to produce a draft of a paper that meets the requirements of specific conferences or journals, with support for writing and submission processes [29][34]. - The course emphasizes a comprehensive research cycle, including methodology, engineering, evaluation, writing, submission, and maintenance [36].
从「知题」到「知人」:UserRL让智能体学会「以人为本」
机器之心· 2025-10-05 06:42
"知人者智,自知者明。"——《道德经》 古人早已洞见:真正的人类智慧,不仅仅在于公式推演、掌握技艺,更是能理解他人、洞察人心。今天的大语言模型已能在代码、数学与工具使用上 出色 地完 成 任务 ,然而距离成为真正的 用户伙伴 ,它们依旧缺少那份 "知人" 的能力。这主要源于现实交互远比解题更加复杂: 这正是智能体面临的下一个时代课题: 从 "会解题" 迈向 "懂用户" 。而要真正回答这一课题,我们需要全新的动态评测框架与训练机制:不仅能测量模型在交互 中的表现,还能驱动其学会在用户不确定与多目标的世界里,问之有道,断之有衡,答之有据。为此,来自 UIUC 与 Salesforce 的研究团队提出了一套系统化方 案: 二者相辅相成,把 "以用户为中心" 从理念落地为 可复现的流程、接口与评测指标 。 UserBench 论文链接:https://arxiv.org/pdf/2507.22034 UserBench 代码仓库:https://github.com/SalesforceAIResearch/UserBench 现实交互中, 用户目标常常未在最初完全成形 (underspecification)、而是 ...
仅需 1 次演示,机器人就能像人手一样抓遍万物?DemoGrasp 刷新灵巧抓取天花板
具身智能之心· 2025-10-04 13:35
点击下方 卡片 ,关注" 具身智能 之心 "公众号 编辑丨具身智能之心 本文只做学术分享,如有侵权,联系删文 >> 点击进入→ 具身智能之心 技术交流群 更多干货,欢迎加入国内首个具身智能全栈学习社区 : 具身智能之心知识星球 (戳我) , 这里包含所有你想要的。 让机器人用多根手指灵活抓取物体,听起来简单,却是机器人操作领域困扰多年的 "老大难" 问题。想象一下:从拿起手机、握住水杯,到夹起薄如纸 片的便签、捏起直径不足 3 厘米的纽扣。这些人类习以为常的动作,对机器人而言,每一步都是高难度挑战。 传统强化学习方法为了让机器人掌握抓取技能,往往要在高自由度(DoFs)的动作空间里反复试错,不仅需要设计复杂的奖励函数和训练课程,还常 常 "学了抓杯子,就忘了抓卡片",泛化能力极差。更棘手的是,仿真环境中训练出的 "抓取高手",一到真实场景就 "水土不服"——没有了精确的物理 参数和物体接触点等 "特权信息",仅靠 RGB 或深度相机的视觉输入,再加上光照、背景变化的干扰,成功率断崖式下跌。 而那些小巧、纤薄的物体,更是传统方法的 "噩梦":硬币容易从指缝滑落,卡片难以找到受力点,想要无碰撞地抓起它们,仿佛让机 ...
北大校友、华人学者金驰新身份——普林斯顿大学终身副教授
机器之心· 2025-10-04 05:30
Core Insights - Chi Jin, a Chinese scholar, has been promoted to tenured associate professor at Princeton University, effective January 16, 2026, marking a significant milestone in his academic career and recognition of his foundational contributions to machine learning theory [1][4]. Group 1: Academic Contributions - Jin joined Princeton's Department of Electrical Engineering and Computer Science in 2019 and has rapidly gained influence in the AI field over his six-year tenure [3]. - His work addresses fundamental challenges in deep learning, particularly the effectiveness of simple optimization methods like Stochastic Gradient Descent (SGD) in non-convex optimization scenarios [8][12]. - Jin's research has established a theoretical foundation for two core issues: efficient training of large and complex models, and ensuring these models are reliable and beneficial in human interactions [11]. Group 2: Non-Convex Optimization - One of the main challenges in deep learning is non-convex optimization, where loss functions have multiple local minima and saddle points, complicating the optimization process [12]. - Jin has demonstrated through multiple papers that even simple gradient methods can effectively escape saddle points with the presence of minimal noise, allowing for continued exploration towards better solutions [12][17]. - His findings have provided a theoretical basis for the practical success of deep learning, alleviating concerns about the robustness of optimization processes in large-scale model training [18]. Group 3: Reinforcement Learning - Jin's research has also significantly advanced the field of reinforcement learning (RL), particularly in establishing sample efficiency, which is crucial for applications with high interaction costs [19]. - He has provided rigorous regret bounds for foundational RL algorithms, proving that model-free algorithms like Q-learning can maintain sample efficiency even in complex settings [22]. - This theoretical groundwork not only addresses academic inquiries but also guides the development of more robust RL algorithms for deployment in high-risk applications [23]. Group 4: Academic Background - Jin holds a Bachelor's degree in Physics from Peking University and a Ph.D. in Electrical Engineering and Computer Science from the University of California, Berkeley, where he was mentored by renowned professor Michael I. Jordan [25]. - His academic background has equipped him with a strong foundation in mathematical and analytical thinking, essential for his theoretical research in AI and machine learning [25]. Group 5: Recognition and Impact - Jin, along with other scholars, received the 2024 Sloan Award, highlighting his contributions to the field [6]. - His papers have garnered significant citations, with a total of 13,588 citations on Google Scholar, indicating the impact of his research in the academic community [27].
理想基座模型负责人近期很满意的工作: RuscaRL
理想TOP2· 2025-10-03 09:55
Core Viewpoint - The article discusses the importance of reinforcement learning (RL) in enhancing the intelligence of large models, emphasizing the need for effective interaction between models and their environments to obtain high-quality feedback [1][2]. Summary by Sections Section 1: Importance of Reinforcement Learning - The article highlights that RL is crucial for the advancement of large model intelligence, with a focus on how to enable models to interact with broader environments to achieve capability generalization [1][8]. - It mentions various RL techniques such as RLHF (Reinforcement Learning from Human Feedback), RLAIF (AI Feedback Reinforcement Learning), and RLVR (Verifiable Reward Reinforcement Learning) as key areas of exploration [1][8]. Section 2: RuscaRL Framework - The RuscaRL framework is introduced as a solution to the exploration bottleneck in RL, utilizing educational psychology's scaffolding theory to enhance the reasoning capabilities of large language models (LLMs) [12][13]. - The framework employs explicit scaffolding and verifiable rewards to guide model training and improve response quality [13][15]. Section 3: Mechanisms of RuscaRL - **Explicit Scaffolding**: This mechanism provides structured guidance through rubrics, helping models generate diverse and high-quality responses while gradually reducing external support as the model's capabilities improve [14]. - **Verifiable Rewards**: RuscaRL designs rewards based on rubrics, allowing for stable and reliable feedback during training, which enhances exploration diversity and ensures knowledge consistency across tasks [15][16]. Section 4: Future Implications - The article suggests that both MindGPT and MindVLA, which target digital and physical worlds respectively, could benefit from the advancements made through RuscaRL, indicating a promising future for self-evolving models [9][10]. - It emphasizes that the current challenges in RL are not just algorithmic but also involve systemic integration of algorithms and infrastructure, highlighting the need for innovative approaches in building capabilities [9].
梦里啥都有?谷歌新世界模型纯靠「想象」训练,学会了在《我的世界》里挖钻石
机器之心· 2025-10-02 01:30
为了在具身环境中解决复杂任务,智能体需要深入理解世界并选择成功的行动。世界模型通过学习从智能体(如机器人或电子游戏玩家)的视角预测潜在行动的 未来结果,为实现这一目标提供了一种有前景的方法。 通过这种方式,世界模型使智能体能够深入理解世界,并具备通过在想象中进行规划或强化学习来选择行动的能力。此外,原则上世界模型可以从固定数据集中 学习,这使得智能体能够纯粹在想象中进行训练,而无需在线交互。对于许多实际应用而言,离线优化行为很有价值,例如物理世界中的机器人,在这种情况 下,与未充分训练的智能体进行在线交互往往不安全。 世界模型智能体 —— 如 Dreamer 3—— 是迄今为止在游戏和机器人领域表现最佳且最为稳健的强化学习算法之一。虽然这些模型在其特定的狭窄环境中速度快且 准确,但其架构缺乏拟合复杂现实世界分布的能力。可控视频模型,如 Genie 3,已在多样的真实视频和游戏上进行训练,并实现了多样的场景生成和简单交互。 这些模型基于可扩展架构,如 diffusion transformer。然而,它们在学习物体交互和游戏机制的精确物理规律方面仍存在困难,这限制了它们在训练成功智能体方面 的实用性。此外,它们 ...