Workflow
slime
icon
Search documents
聊聊关于 Agentic RL 训推框架的一点看法和思考
自动驾驶之心· 2025-12-16 00:03
作者 | 浮生梦晓@知乎 转自 | 对比现有的 RL 训练框架! 我一直想找一个社区活跃度比较高,对于环境适配代码相对修改较少的框架,这里直接说,最后选择了 AReaL。 (我的具体业务环境不展开说了,简单来说是需要每个训练样本都有不同的环境状态,除了模型的输出内容去环境里 执行动作以外,还需要框架会话与环境多次交互,这一点就卡死了大部分 RL 框架的 agent loop 控制流,当然除非 做侵入式代码修改,但框架更新后 rebase 又很麻烦) 原文链接: https://zhuanlan.zhihu.com/p/1979237927641949997 点击下方 卡片 ,关注" 大模型之心Tech "公众号 戳我-> 领取大模型巨卷干货 本文只做学术分享,已获转载授权 ,欢迎添加小助理微信AIDriver004做进一步咨询 前 段 时 间 调 研 了 一 些 RL 训 练 框 架 , 目 前 开 源 社 区 的 RL 训 练 框 架 可 以 说 百 花 齐 放 , 老 牌 的 有 openlhf 、 trl 、 unsloth、verl。还有今年新开源的 slime、AReaL、Rlinf、RL2、ROL ...
X @The Economist
The Economist· 2025-12-02 14:05
Product Offering - The company is offering 24 pots of slime or packets of pork scratchings [1]
Z Tech | LMSYS 团队发布大规模  MoE 强化学习框架 Miles,不积跬步无以至千里
Z Potentials· 2025-11-20 04:12
Core Insights - The article introduces Miles, a new reinforcement learning framework designed for enterprise-level large-scale MoE training and production workloads, developed by the LMSYS team as a fork of the lightweight framework slime [1][4]. Group 1: Framework Features - Miles inherits the lightweight and modular design principles of slime, making it a preferred tool for model scientists exploring algorithms [3]. - It implements Infrastructure-level True On-Policy to eliminate discrepancies between training and inference, achieving bit-wise consistency [5]. - The framework introduces speculative training through MTP Online Training, resulting in over 25% rollout acceleration [3][9]. Group 2: Memory Optimization - Miles incorporates advanced memory management techniques to maximize GPU performance without triggering out-of-memory (OOM) errors [8]. - It features online SFT for Draft Models, which enhances performance by preventing a decline in acceptance length during training [9]. - The framework includes mechanisms to avoid benign OOM errors and implements memory margin strategies to address NCCL-related OOM issues [10]. Group 3: Technical Upgrades - Miles supports full-stack optimization for SGLang and Megatron, ensuring compatibility with rapid iterations in training and inference frameworks [6]. - The modular design allows researchers to easily modify components like algorithms, data, sampling, and evaluation with minimal code changes [6]. - It provides a user-friendly interface for model scientists, allowing them to adjust important sampling or loss dynamics without delving into lower-level code [6]. Group 4: Future Development - The LMSYS team plans to enhance the FSDP backend for improved stability in large-scale distributed training [14]. - Future developments include independent rollout deployment, additional debugging tools, and formal mathematical verification for SFT/RL scripts [14]. - The roadmap also aims to support next-generation hardware like GB300 and expand capabilities for multi-modal training [18].
大模型优秀大脑齐聚硬核开源聚会,SGLang社区举办国内首次Meetup
机器之心· 2025-10-28 06:29
Core Insights - The Pytorch Conference 2025 showcased the vibrant community and significant developments in deep learning, particularly highlighting SGLang's contributions and potential in the industry [1][3][4]. SGLang Overview - SGLang, an open-source high-performance inference engine for large language models and visual language models, originated from RadixAttention and is incubated by the non-profit organization LMSYS. It offers low latency and high throughput inference across various environments, from single GPUs to large distributed clusters [7][8]. Community Engagement - The first Meetup event in Beijing, co-hosted by SGLang, Meituan, and Amazon Web Services, attracted numerous contributors, developers, and scholars, indicating a strong community presence and development potential [4][8]. Technical Developments - The Meetup featured technical discussions on SGLang's architecture, including advancements in KV Cache, Piecewise CUDA Graph, and Spec Decoding, aimed at improving efficiency and compatibility [21][22]. - SGLang's quantization strategies were also discussed, focusing on expanding application range and optimizing model performance [34][35]. Application and Practice - Various industry applications of SGLang were presented, including its integration with Baidu's Ernie 4.5 model for large-scale deployment and optimization in search scenarios [41][42]. - The application of SGLang in WeChat's search function was highlighted, emphasizing the need for high throughput and low latency in user experience [44]. Future Directions - The roadmap for SGLang includes further integration with various hardware and software solutions, aiming to enhance stability and compatibility across different platforms [22][35]. - The Specforge framework, developed by the SGLang team, aims to accelerate large language model inference and has been adopted by major companies like Meituan and NVIDIA [57][58].
首个开源实现100%可复现的稳定RL训练框架来了!2次结果完全重合
量子位· 2025-09-27 01:30
Core Insights - The article discusses the achievement of SGLang and slime teams in creating a fully reproducible and stable reinforcement learning (RL) training framework based on the Qwen3-8B model, addressing the issue of non-deterministic outputs in large language model (LLM) inference [1][2][6]. Group 1: Deterministic Inference - SGLang and slime teams have developed a deterministic inference solution that integrates batch invariant operators, CUDA Graph, radix cache, and chunked prefill, ensuring high performance while maintaining compatibility with key features [5][8]. - The implementation of batch invariant operators addresses the core issue of output uncertainty in LLM inference, which arises from varying batch sizes during dynamic batching [7][8]. - Testing has shown that the average performance drop for SGLang's solution is 34.35%, significantly better than the 61.5% decline reported by Thinking Machines Lab [5][12]. Group 2: Performance Metrics - The article presents performance metrics for different inference modes, showing that deterministic modes yield consistent outputs across various batch sizes, with unique output counts significantly reduced [10][11]. - In terms of end-to-end latency, deterministic inference shows a performance drop of 25% to 45%, with specific backend performance metrics indicating improvements in certain configurations [12][13]. Group 3: Future Developments - Future efforts will focus on optimizing batch invariant operators to enhance performance, particularly for RL inference, and expanding support to mixture of experts (MoE) models [16][18]. - The team aims to improve radix cache functionality and explore tensor parallelism to further enhance the capabilities of deterministic inference [18].
从现有主流 RL 库来聊聊RL Infra架构演进
自动驾驶之心· 2025-09-25 23:33
Core Viewpoint - Reinforcement Learning (RL) is transitioning from a supportive technology to a core driver of model capabilities, focusing on multi-step, interactive agent training to achieve General Artificial Intelligence (AGI) [2][6]. Group 1: Modern RL Infrastructure Architecture - The core components of modern RL infrastructure include a Generator, which interacts with the environment to generate trajectories and calculate rewards, and a Trainer, which updates model parameters based on trajectory data [6][4]. - The generator-trainer architecture, combined with distributed coordination layers like Ray, forms the "gold standard" for RL systems [6][4]. Group 2: Primary Development - Primary Development frameworks serve as foundational frameworks for building RL training pipelines, providing core algorithm implementations and integration with underlying training/inference engines [8][7]. - TRL (Transformer Reinforcement Learning) is a user-friendly RL framework launched by Hugging Face, offering various algorithm supports [9][10]. - OpenRLHF, developed by a collaborative team including ByteDance and NetEase, aims to provide an efficient and scalable RLHF and Agentic RL framework [11][14]. - veRL, developed by Byte's Seed team, is one of the most comprehensive frameworks with extensive algorithm support [16][19]. - AReaL (Asynchronous Reinforcement Learning) is designed for large-scale, high-throughput RL training with a fully asynchronous architecture [20][21]. - NeMo-RL, launched by NVIDIA, integrates into its extensive NeMo ecosystem, focusing on production-level RL frameworks [24][28]. - ROLL, an Alibaba open-source framework, emphasizes asynchronous and Agentic capabilities for large-scale LLM RL [30][33]. - slime, developed by Tsinghua and Zhipu, is a lightweight framework focusing on seamless integration of SGLang with Megatron [34][36]. Group 3: Secondary Development - Secondary Development frameworks are built on primary frameworks, targeting specific downstream application scenarios like multi-modal, multi-agent, and GUI automation [44][3]. - Agentic RL frameworks, such as verl-agent, optimize for asynchronous rollout and training, addressing the core challenges of multi-round interactions with external environments [46][47]. - Multimodal RL frameworks, like VLM-R1 and EasyR1, focus on training visual-language reasoning models, addressing data processing and loss function design challenges [53][54]. - Multi-Agent RL frameworks, such as MARTI, integrate multi-agent reasoning and reinforcement learning for complex collaborative tasks [59][60]. Group 4: Summary and Trends - The RL infrastructure is evolving from a "workshop" model to a "standardized pipeline," with increasing modularity in framework design [65]. - Asynchronous architectures are becoming essential to address the computational asymmetry between rollout and training [66]. - The emergence of high-performance inference engines like vLLM and SGLang significantly accelerates the rollout process [66]. - The evolution from RLHF to Agentic RL reflects the growing complexity of tasks supported by new frameworks [66]. - Distributed training framework choices, such as Megatron-LM and DeepSpeed, are critical for large-scale model training [66]. - Scene-driven secondary development frameworks are addressing unique challenges in vertical domains [66]. - The importance of orchestrators for managing distributed components in RL systems is becoming widely recognized [66].
智谱终于发布GLM-4.5技术报告,从预训练到后训练,细节大公开
机器之心· 2025-08-11 07:12
Core Viewpoint - The article highlights the release of GLM-4.5 and GLM-4.5-Air, which integrate reasoning, coding, and agentic capabilities into a single model, achieving the highest ranking among domestic and open-source models in 12 global benchmarks [2][11][19]. Group 1: Model Performance and Reception - GLM-4.5 achieved third place in global rankings across 12 recognized benchmarks, outperforming all domestic and open-source models [2][19]. - The model's announcement generated significant attention, with over 1.2 million views on social media and topping the Hugging Face trends for seven consecutive days [2][3]. - The technical report for GLM-4.5 was voted as the "1 Paper of the day" by Hugging Face users [13]. Group 2: Technical Innovations - GLM-4.5 employs a MoE (Mixture of Experts) architecture, enhancing computational efficiency during training and inference [21][24]. - The model features a unique training process, including pre-training on 15 trillion tokens and mid-training on 7 trillion tokens, with a maximum sequence length expanded from 4K to 128K [25][27]. - The introduction of the slime framework supports efficient reinforcement learning training, addressing common bottlenecks in agentic tasks [31][34]. Group 3: Key Capabilities - GLM-4.5 integrates three core capabilities: agentic ability for real-world interaction, complex reasoning for multi-step problem-solving, and advanced coding skills for software engineering tasks [22][19]. - The model's performance in agentic tasks was evaluated against competitors, showing superior results in benchmarks like TAU-bench and BFCL V3 [44]. - In reasoning tasks, GLM-4.5 outperformed OpenAI's models in several benchmarks, including AIME 24 and SciCode [47][50]. Group 4: Code Task Performance - GLM-4.5 excelled in code-related benchmarks, outperforming GPT-4.1 and Claude Sonnet 4 in SWE-bench Verified and Terminal-Bench [52][53]. - The model's overall performance in coding tasks positions it as a strong competitor to Claude Sonnet 4 [53]. Group 5: Future Implications - The release of the technical report provides insights into the development direction for domestic open-source large models, serving as a key reference for future research [56][57].