PPO算法
Search documents
聊聊关于 Agentic RL 训推框架的一点看法和思考
自动驾驶之心· 2025-12-16 00:03
Core Viewpoint - The article discusses the current landscape of Reinforcement Learning (RL) training frameworks, highlighting the diversity and specific strengths and weaknesses of various open-source options, particularly focusing on the challenges of adapting these frameworks for multi-modal models in real-world environments [2][3]. Summary by Sections Overview of RL Frameworks - The open-source community has a wide variety of RL training frameworks, including established ones like openlhf, trl, unsloth, and verl, as well as newer entries like slime, AReaL, Rlinf, RL2, and ROLL [2]. Framework Selection Criteria - The author emphasizes the need for a community-active framework that requires minimal code modification for environmental adaptation, ultimately selecting AReaL due to its flexibility in handling multi-turn interactions [3]. GPU Management in RL Training - The article discusses the GPU orchestration challenges in RL training, noting that traditional frameworks often follow a synchronous training model, which can lead to inefficiencies and wasted resources [5][12]. Data Flow and Structure - The data flow in RL training frameworks is crucial, with verl using a specific data format called DataProto for efficient data transfer, although this can become a burden in agentic RL scenarios [10][11]. Asynchronous vs. Synchronous Training - Asynchronous RL training frameworks are highlighted for their efficiency, but they also introduce challenges such as data offset issues and higher GPU resource consumption compared to synchronous models [11][12]. Control Flow in RL Training - The control flow in RL training remains primarily on the training side, with the article explaining that the training process is similar to standard LLM training, differing mainly in the loss function used [15]. Weight Transfer Between Engines - The article details the complexities involved in transferring model weights from the training engine to the inference engine, particularly when the two engines have different model partitioning schemes [16][19]. Gaps in RL Training - Two significant gaps are identified: the need for on-policy data in RL training and the discrepancies in token distributions between rollout and prefill processes, which complicate the calculation of importance sampling [20][23]. Environment Adaptation and Reward Management - The article emphasizes the importance of environment adaptation and reward calculation in agentic RL training, noting that different frameworks handle these aspects differently, with AReaL and slime offering more flexible solutions [24][26]. Asynchronous Training Solutions - AReaL's asynchronous training approach is presented as a mature solution, utilizing a producer-consumer model to manage data flow efficiently [29][30]. Partial Rollout Management - The concept of partial rollout is introduced as a method to manage ongoing tasks during model weight updates, allowing for efficient training without interrupting the inference process [37][38]. Insights on RL Algorithms - The article concludes with reflections on RL algorithms, discussing the challenges of reward structuring and the potential benefits of staged training approaches [39][40]. Code Complexity and Usability - The author notes the complexity of the code in frameworks like AReaL and verl, suggesting that while they are well-engineered, they may pose a steep learning curve for new users [43][44].
清华大学最新!πRL:用在线强化学习让机器人 “边学边做” 的通用方案
具身智能之心· 2025-11-03 00:03
Core Insights - The article discusses the breakthrough in adapting Reinforcement Learning (RL) for flow-based Vision-Language-Action (VLA) models, overcoming the limitations of traditional supervised fine-tuning (SFT) and existing RL approaches [1][3][30] Group 1: Challenges in Current VLA Model Training - Current VLA model training faces a dilemma: SFT relies on large expert trajectories, which are costly and have weak generalization, while RL cannot adapt to the core characteristics of flow-based models [3][4] - The core issue is the fundamental barrier in RL adaptation for flow-based VLA models, primarily due to the difficulty in calculating action log-likelihood during the denoising process [4][5] Group 2: Innovative Solutions Proposed - A new framework using "Flow-Noise and Flow-SDE dual algorithms + parallel simulation training" has been proposed to address the RL adaptation challenges for flow-based VLA models [1][5] - The Flow-Noise algorithm introduces a learnable noise network to optimize the denoising process, while Flow-SDE converts deterministic ODE denoising into stochastic SDE to balance exploration and efficiency [7][9] Group 3: Performance Improvements - The proposed methods have shown significant performance improvements in multi-task benchmark tests, achieving near-perfect scores and breaking through the SFT bottleneck [15][16] - In the LIBERO benchmark, the Flow-Noise and Flow-SDE models achieved average scores of 97.6% and 96.1% respectively, significantly outperforming traditional SFT methods [16][18] Group 4: Large-Scale Adaptation and Training - The framework supports large-scale multi-task optimization, demonstrated by the ability to handle 4,352 task combinations in the ManiSkill benchmark while maintaining performance advantages [20][22] - The use of 320 parallel environments for training significantly reduces data transmission delays and enhances optimization efficiency [17][22] Group 5: Future Directions - Future research will focus on optimizing noise injection strategies, improving out-of-distribution (OOD) generalization, and validating the framework's adaptability in real-world robotic applications [29][30] - The integration of multi-modal observations, such as tactile and force feedback, is also suggested to enhance robustness in complex scenarios [29][30]