Workflow
slime
icon
Search documents
聊聊关于 Agentic RL 训推框架的一点看法和思考
自动驾驶之心· 2025-12-16 00:03
Core Viewpoint - The article discusses the current landscape of Reinforcement Learning (RL) training frameworks, highlighting the diversity and specific strengths and weaknesses of various open-source options, particularly focusing on the challenges of adapting these frameworks for multi-modal models in real-world environments [2][3]. Summary by Sections Overview of RL Frameworks - The open-source community has a wide variety of RL training frameworks, including established ones like openlhf, trl, unsloth, and verl, as well as newer entries like slime, AReaL, Rlinf, RL2, and ROLL [2]. Framework Selection Criteria - The author emphasizes the need for a community-active framework that requires minimal code modification for environmental adaptation, ultimately selecting AReaL due to its flexibility in handling multi-turn interactions [3]. GPU Management in RL Training - The article discusses the GPU orchestration challenges in RL training, noting that traditional frameworks often follow a synchronous training model, which can lead to inefficiencies and wasted resources [5][12]. Data Flow and Structure - The data flow in RL training frameworks is crucial, with verl using a specific data format called DataProto for efficient data transfer, although this can become a burden in agentic RL scenarios [10][11]. Asynchronous vs. Synchronous Training - Asynchronous RL training frameworks are highlighted for their efficiency, but they also introduce challenges such as data offset issues and higher GPU resource consumption compared to synchronous models [11][12]. Control Flow in RL Training - The control flow in RL training remains primarily on the training side, with the article explaining that the training process is similar to standard LLM training, differing mainly in the loss function used [15]. Weight Transfer Between Engines - The article details the complexities involved in transferring model weights from the training engine to the inference engine, particularly when the two engines have different model partitioning schemes [16][19]. Gaps in RL Training - Two significant gaps are identified: the need for on-policy data in RL training and the discrepancies in token distributions between rollout and prefill processes, which complicate the calculation of importance sampling [20][23]. Environment Adaptation and Reward Management - The article emphasizes the importance of environment adaptation and reward calculation in agentic RL training, noting that different frameworks handle these aspects differently, with AReaL and slime offering more flexible solutions [24][26]. Asynchronous Training Solutions - AReaL's asynchronous training approach is presented as a mature solution, utilizing a producer-consumer model to manage data flow efficiently [29][30]. Partial Rollout Management - The concept of partial rollout is introduced as a method to manage ongoing tasks during model weight updates, allowing for efficient training without interrupting the inference process [37][38]. Insights on RL Algorithms - The article concludes with reflections on RL algorithms, discussing the challenges of reward structuring and the potential benefits of staged training approaches [39][40]. Code Complexity and Usability - The author notes the complexity of the code in frameworks like AReaL and verl, suggesting that while they are well-engineered, they may pose a steep learning curve for new users [43][44].
X @The Economist
The Economist· 2025-12-02 14:05
Product Offering - The company is offering 24 pots of slime or packets of pork scratchings [1]
Z Tech | LMSYS 团队发布大规模  MoE 强化学习框架 Miles,不积跬步无以至千里
Z Potentials· 2025-11-20 04:12
Core Insights - The article introduces Miles, a new reinforcement learning framework designed for enterprise-level large-scale MoE training and production workloads, developed by the LMSYS team as a fork of the lightweight framework slime [1][4]. Group 1: Framework Features - Miles inherits the lightweight and modular design principles of slime, making it a preferred tool for model scientists exploring algorithms [3]. - It implements Infrastructure-level True On-Policy to eliminate discrepancies between training and inference, achieving bit-wise consistency [5]. - The framework introduces speculative training through MTP Online Training, resulting in over 25% rollout acceleration [3][9]. Group 2: Memory Optimization - Miles incorporates advanced memory management techniques to maximize GPU performance without triggering out-of-memory (OOM) errors [8]. - It features online SFT for Draft Models, which enhances performance by preventing a decline in acceptance length during training [9]. - The framework includes mechanisms to avoid benign OOM errors and implements memory margin strategies to address NCCL-related OOM issues [10]. Group 3: Technical Upgrades - Miles supports full-stack optimization for SGLang and Megatron, ensuring compatibility with rapid iterations in training and inference frameworks [6]. - The modular design allows researchers to easily modify components like algorithms, data, sampling, and evaluation with minimal code changes [6]. - It provides a user-friendly interface for model scientists, allowing them to adjust important sampling or loss dynamics without delving into lower-level code [6]. Group 4: Future Development - The LMSYS team plans to enhance the FSDP backend for improved stability in large-scale distributed training [14]. - Future developments include independent rollout deployment, additional debugging tools, and formal mathematical verification for SFT/RL scripts [14]. - The roadmap also aims to support next-generation hardware like GB300 and expand capabilities for multi-modal training [18].
大模型优秀大脑齐聚硬核开源聚会,SGLang社区举办国内首次Meetup
机器之心· 2025-10-28 06:29
Core Insights - The Pytorch Conference 2025 showcased the vibrant community and significant developments in deep learning, particularly highlighting SGLang's contributions and potential in the industry [1][3][4]. SGLang Overview - SGLang, an open-source high-performance inference engine for large language models and visual language models, originated from RadixAttention and is incubated by the non-profit organization LMSYS. It offers low latency and high throughput inference across various environments, from single GPUs to large distributed clusters [7][8]. Community Engagement - The first Meetup event in Beijing, co-hosted by SGLang, Meituan, and Amazon Web Services, attracted numerous contributors, developers, and scholars, indicating a strong community presence and development potential [4][8]. Technical Developments - The Meetup featured technical discussions on SGLang's architecture, including advancements in KV Cache, Piecewise CUDA Graph, and Spec Decoding, aimed at improving efficiency and compatibility [21][22]. - SGLang's quantization strategies were also discussed, focusing on expanding application range and optimizing model performance [34][35]. Application and Practice - Various industry applications of SGLang were presented, including its integration with Baidu's Ernie 4.5 model for large-scale deployment and optimization in search scenarios [41][42]. - The application of SGLang in WeChat's search function was highlighted, emphasizing the need for high throughput and low latency in user experience [44]. Future Directions - The roadmap for SGLang includes further integration with various hardware and software solutions, aiming to enhance stability and compatibility across different platforms [22][35]. - The Specforge framework, developed by the SGLang team, aims to accelerate large language model inference and has been adopted by major companies like Meituan and NVIDIA [57][58].
首个开源实现100%可复现的稳定RL训练框架来了!2次结果完全重合
量子位· 2025-09-27 01:30
Core Insights - The article discusses the achievement of SGLang and slime teams in creating a fully reproducible and stable reinforcement learning (RL) training framework based on the Qwen3-8B model, addressing the issue of non-deterministic outputs in large language model (LLM) inference [1][2][6]. Group 1: Deterministic Inference - SGLang and slime teams have developed a deterministic inference solution that integrates batch invariant operators, CUDA Graph, radix cache, and chunked prefill, ensuring high performance while maintaining compatibility with key features [5][8]. - The implementation of batch invariant operators addresses the core issue of output uncertainty in LLM inference, which arises from varying batch sizes during dynamic batching [7][8]. - Testing has shown that the average performance drop for SGLang's solution is 34.35%, significantly better than the 61.5% decline reported by Thinking Machines Lab [5][12]. Group 2: Performance Metrics - The article presents performance metrics for different inference modes, showing that deterministic modes yield consistent outputs across various batch sizes, with unique output counts significantly reduced [10][11]. - In terms of end-to-end latency, deterministic inference shows a performance drop of 25% to 45%, with specific backend performance metrics indicating improvements in certain configurations [12][13]. Group 3: Future Developments - Future efforts will focus on optimizing batch invariant operators to enhance performance, particularly for RL inference, and expanding support to mixture of experts (MoE) models [16][18]. - The team aims to improve radix cache functionality and explore tensor parallelism to further enhance the capabilities of deterministic inference [18].
从现有主流 RL 库来聊聊RL Infra架构演进
自动驾驶之心· 2025-09-25 23:33
Core Viewpoint - Reinforcement Learning (RL) is transitioning from a supportive technology to a core driver of model capabilities, focusing on multi-step, interactive agent training to achieve General Artificial Intelligence (AGI) [2][6]. Group 1: Modern RL Infrastructure Architecture - The core components of modern RL infrastructure include a Generator, which interacts with the environment to generate trajectories and calculate rewards, and a Trainer, which updates model parameters based on trajectory data [6][4]. - The generator-trainer architecture, combined with distributed coordination layers like Ray, forms the "gold standard" for RL systems [6][4]. Group 2: Primary Development - Primary Development frameworks serve as foundational frameworks for building RL training pipelines, providing core algorithm implementations and integration with underlying training/inference engines [8][7]. - TRL (Transformer Reinforcement Learning) is a user-friendly RL framework launched by Hugging Face, offering various algorithm supports [9][10]. - OpenRLHF, developed by a collaborative team including ByteDance and NetEase, aims to provide an efficient and scalable RLHF and Agentic RL framework [11][14]. - veRL, developed by Byte's Seed team, is one of the most comprehensive frameworks with extensive algorithm support [16][19]. - AReaL (Asynchronous Reinforcement Learning) is designed for large-scale, high-throughput RL training with a fully asynchronous architecture [20][21]. - NeMo-RL, launched by NVIDIA, integrates into its extensive NeMo ecosystem, focusing on production-level RL frameworks [24][28]. - ROLL, an Alibaba open-source framework, emphasizes asynchronous and Agentic capabilities for large-scale LLM RL [30][33]. - slime, developed by Tsinghua and Zhipu, is a lightweight framework focusing on seamless integration of SGLang with Megatron [34][36]. Group 3: Secondary Development - Secondary Development frameworks are built on primary frameworks, targeting specific downstream application scenarios like multi-modal, multi-agent, and GUI automation [44][3]. - Agentic RL frameworks, such as verl-agent, optimize for asynchronous rollout and training, addressing the core challenges of multi-round interactions with external environments [46][47]. - Multimodal RL frameworks, like VLM-R1 and EasyR1, focus on training visual-language reasoning models, addressing data processing and loss function design challenges [53][54]. - Multi-Agent RL frameworks, such as MARTI, integrate multi-agent reasoning and reinforcement learning for complex collaborative tasks [59][60]. Group 4: Summary and Trends - The RL infrastructure is evolving from a "workshop" model to a "standardized pipeline," with increasing modularity in framework design [65]. - Asynchronous architectures are becoming essential to address the computational asymmetry between rollout and training [66]. - The emergence of high-performance inference engines like vLLM and SGLang significantly accelerates the rollout process [66]. - The evolution from RLHF to Agentic RL reflects the growing complexity of tasks supported by new frameworks [66]. - Distributed training framework choices, such as Megatron-LM and DeepSpeed, are critical for large-scale model training [66]. - Scene-driven secondary development frameworks are addressing unique challenges in vertical domains [66]. - The importance of orchestrators for managing distributed components in RL systems is becoming widely recognized [66].
智谱终于发布GLM-4.5技术报告,从预训练到后训练,细节大公开
机器之心· 2025-08-11 07:12
Core Viewpoint - The article highlights the release of GLM-4.5 and GLM-4.5-Air, which integrate reasoning, coding, and agentic capabilities into a single model, achieving the highest ranking among domestic and open-source models in 12 global benchmarks [2][11][19]. Group 1: Model Performance and Reception - GLM-4.5 achieved third place in global rankings across 12 recognized benchmarks, outperforming all domestic and open-source models [2][19]. - The model's announcement generated significant attention, with over 1.2 million views on social media and topping the Hugging Face trends for seven consecutive days [2][3]. - The technical report for GLM-4.5 was voted as the "1 Paper of the day" by Hugging Face users [13]. Group 2: Technical Innovations - GLM-4.5 employs a MoE (Mixture of Experts) architecture, enhancing computational efficiency during training and inference [21][24]. - The model features a unique training process, including pre-training on 15 trillion tokens and mid-training on 7 trillion tokens, with a maximum sequence length expanded from 4K to 128K [25][27]. - The introduction of the slime framework supports efficient reinforcement learning training, addressing common bottlenecks in agentic tasks [31][34]. Group 3: Key Capabilities - GLM-4.5 integrates three core capabilities: agentic ability for real-world interaction, complex reasoning for multi-step problem-solving, and advanced coding skills for software engineering tasks [22][19]. - The model's performance in agentic tasks was evaluated against competitors, showing superior results in benchmarks like TAU-bench and BFCL V3 [44]. - In reasoning tasks, GLM-4.5 outperformed OpenAI's models in several benchmarks, including AIME 24 and SciCode [47][50]. Group 4: Code Task Performance - GLM-4.5 excelled in code-related benchmarks, outperforming GPT-4.1 and Claude Sonnet 4 in SWE-bench Verified and Terminal-Bench [52][53]. - The model's overall performance in coding tasks positions it as a strong competitor to Claude Sonnet 4 [53]. Group 5: Future Implications - The release of the technical report provides insights into the development direction for domestic open-source large models, serving as a key reference for future research [56][57].