ROLL

Search documents
从现有主流 RL 库来聊聊RL Infra架构演进
自动驾驶之心· 2025-09-25 23:33
Core Viewpoint - Reinforcement Learning (RL) is transitioning from a supportive technology to a core driver of model capabilities, focusing on multi-step, interactive agent training to achieve General Artificial Intelligence (AGI) [2][6]. Group 1: Modern RL Infrastructure Architecture - The core components of modern RL infrastructure include a Generator, which interacts with the environment to generate trajectories and calculate rewards, and a Trainer, which updates model parameters based on trajectory data [6][4]. - The generator-trainer architecture, combined with distributed coordination layers like Ray, forms the "gold standard" for RL systems [6][4]. Group 2: Primary Development - Primary Development frameworks serve as foundational frameworks for building RL training pipelines, providing core algorithm implementations and integration with underlying training/inference engines [8][7]. - TRL (Transformer Reinforcement Learning) is a user-friendly RL framework launched by Hugging Face, offering various algorithm supports [9][10]. - OpenRLHF, developed by a collaborative team including ByteDance and NetEase, aims to provide an efficient and scalable RLHF and Agentic RL framework [11][14]. - veRL, developed by Byte's Seed team, is one of the most comprehensive frameworks with extensive algorithm support [16][19]. - AReaL (Asynchronous Reinforcement Learning) is designed for large-scale, high-throughput RL training with a fully asynchronous architecture [20][21]. - NeMo-RL, launched by NVIDIA, integrates into its extensive NeMo ecosystem, focusing on production-level RL frameworks [24][28]. - ROLL, an Alibaba open-source framework, emphasizes asynchronous and Agentic capabilities for large-scale LLM RL [30][33]. - slime, developed by Tsinghua and Zhipu, is a lightweight framework focusing on seamless integration of SGLang with Megatron [34][36]. Group 3: Secondary Development - Secondary Development frameworks are built on primary frameworks, targeting specific downstream application scenarios like multi-modal, multi-agent, and GUI automation [44][3]. - Agentic RL frameworks, such as verl-agent, optimize for asynchronous rollout and training, addressing the core challenges of multi-round interactions with external environments [46][47]. - Multimodal RL frameworks, like VLM-R1 and EasyR1, focus on training visual-language reasoning models, addressing data processing and loss function design challenges [53][54]. - Multi-Agent RL frameworks, such as MARTI, integrate multi-agent reasoning and reinforcement learning for complex collaborative tasks [59][60]. Group 4: Summary and Trends - The RL infrastructure is evolving from a "workshop" model to a "standardized pipeline," with increasing modularity in framework design [65]. - Asynchronous architectures are becoming essential to address the computational asymmetry between rollout and training [66]. - The emergence of high-performance inference engines like vLLM and SGLang significantly accelerates the rollout process [66]. - The evolution from RLHF to Agentic RL reflects the growing complexity of tasks supported by new frameworks [66]. - Distributed training framework choices, such as Megatron-LM and DeepSpeed, are critical for large-scale model training [66]. - Scene-driven secondary development frameworks are addressing unique challenges in vertical domains [66]. - The importance of orchestrators for managing distributed components in RL systems is becoming widely recognized [66].
强化学习框架的演进与发展趋势
自动驾驶之心· 2025-08-18 23:32
Group 1 - The article discusses the transition from Supervised Fine-Tuning (SFT) to Reinforcement Learning (RL) in model training paradigms, highlighting that RL is becoming increasingly critical for enhancing model capabilities [3][4][8] - RL algorithms are evolving with new methods such as GRPO, RLOO, and DAPO, focusing on improving stability and sample efficiency [4] - The RL training process consists of three main modules: Rollout (policy generation), Reward Evaluation, and Policy Update, each playing a vital role in the training framework [5][6][7] Group 2 - The design of RL training frameworks faces challenges in coordinating Rollout and training modules, especially with the increasing model scale and the need for distributed multi-GPU training [12][13] - There is a diversity of underlying training and inference frameworks, which complicates parameter synchronization and inference scheduling [14] - Performance optimization strategies include data parallelism, tensor parallelism, and pipeline parallelism, each with distinct advantages and limitations [22][24] Group 3 - The article outlines the importance of efficient data transfer mechanisms and parameter synchronization between training frameworks and inference engines, emphasizing the need for flexible communication strategies [32][39] - SLIME and ROLL frameworks are introduced, showcasing their approaches to managing data transfer and parameter synchronization effectively [42][46] - The integration of Ray for distributed computing is discussed, highlighting its role in managing resource allocation and communication in complex RL tasks [48][53] Group 4 - The article concludes with a comparison of various RL frameworks, such as SLIME, ROLL, and Verl, each catering to different needs and offering unique features for specific applications [61] - The rapid evolution of technology necessitates maintaining simplicity and high maintainability in framework design to adapt to new trends [58] - The article emphasizes the significance of open-source frameworks in advancing RL technology, particularly in the context of China's leading position in technical strength and understanding [60]
任务级奖励提升App Agent思考力,淘天提出Mobile-R1,3B模型可超32B
量子位· 2025-07-20 02:49
Core Insights - The article discusses the limitations of existing Mobile/APP Agents that primarily rely on action-level rewards, which restrict their adaptability in dynamic environments [1][2] - A new interactive reinforcement learning framework called Mobile-R1 is proposed, which incorporates task-level rewards to enhance agent adaptability and exploration capabilities [5][30] - The training process for Mobile-R1 consists of three stages: format fine-tuning, action-level training, and task-level training, which collectively improve the model's performance [6][31] Summary by Sections Existing Limitations - Current Mobile/APP Agents struggle with real-time adaptability due to their reliance on action-level rewards, making it difficult to handle changing mobile environments [1][2] - An example illustrates the failure of existing models in executing complex multi-step tasks [3] Proposed Solution - The collaboration between TaoTian Group's algorithm team and Future Life Lab introduces a multi-round, task-oriented learning approach that combines online learning and trajectory correction [4] - Mobile-R1 is designed to utilize task-level rewards, which are more effective in guiding agents through complex tasks [5] Training Methodology - The training process is divided into three stages: 1. **Format Fine-tuning**: Initial adjustments using supervised fine-tuning with high-quality trajectory data [16] 2. **Action-level Training**: Utilizes group relative policy optimization (GRPO) to evaluate action correctness with action-level rewards [17] 3. **Task-level Training**: Enhances model generalization and exploration through multi-step task-level training [18][20] Experimental Results - Mobile-R1 demonstrated superior performance across various benchmarks, achieving a task success rate of 49.40%, significantly higher than the best baseline model [26] - The results indicate that the three-stage training process effectively improves the model's robustness and adaptability, particularly in dynamic environments [29][30] - The article concludes that Mobile-R1's integration of interactive reinforcement learning and task-level rewards significantly enhances the capabilities of visual language model-based mobile agents [30][32]