Workflow
slime
icon
Search documents
大模型优秀大脑齐聚硬核开源聚会,SGLang社区举办国内首次Meetup
机器之心· 2025-10-28 06:29
Core Insights - The Pytorch Conference 2025 showcased the vibrant community and significant developments in deep learning, particularly highlighting SGLang's contributions and potential in the industry [1][3][4]. SGLang Overview - SGLang, an open-source high-performance inference engine for large language models and visual language models, originated from RadixAttention and is incubated by the non-profit organization LMSYS. It offers low latency and high throughput inference across various environments, from single GPUs to large distributed clusters [7][8]. Community Engagement - The first Meetup event in Beijing, co-hosted by SGLang, Meituan, and Amazon Web Services, attracted numerous contributors, developers, and scholars, indicating a strong community presence and development potential [4][8]. Technical Developments - The Meetup featured technical discussions on SGLang's architecture, including advancements in KV Cache, Piecewise CUDA Graph, and Spec Decoding, aimed at improving efficiency and compatibility [21][22]. - SGLang's quantization strategies were also discussed, focusing on expanding application range and optimizing model performance [34][35]. Application and Practice - Various industry applications of SGLang were presented, including its integration with Baidu's Ernie 4.5 model for large-scale deployment and optimization in search scenarios [41][42]. - The application of SGLang in WeChat's search function was highlighted, emphasizing the need for high throughput and low latency in user experience [44]. Future Directions - The roadmap for SGLang includes further integration with various hardware and software solutions, aiming to enhance stability and compatibility across different platforms [22][35]. - The Specforge framework, developed by the SGLang team, aims to accelerate large language model inference and has been adopted by major companies like Meituan and NVIDIA [57][58].
首个开源实现100%可复现的稳定RL训练框架来了!2次结果完全重合
量子位· 2025-09-27 01:30
Core Insights - The article discusses the achievement of SGLang and slime teams in creating a fully reproducible and stable reinforcement learning (RL) training framework based on the Qwen3-8B model, addressing the issue of non-deterministic outputs in large language model (LLM) inference [1][2][6]. Group 1: Deterministic Inference - SGLang and slime teams have developed a deterministic inference solution that integrates batch invariant operators, CUDA Graph, radix cache, and chunked prefill, ensuring high performance while maintaining compatibility with key features [5][8]. - The implementation of batch invariant operators addresses the core issue of output uncertainty in LLM inference, which arises from varying batch sizes during dynamic batching [7][8]. - Testing has shown that the average performance drop for SGLang's solution is 34.35%, significantly better than the 61.5% decline reported by Thinking Machines Lab [5][12]. Group 2: Performance Metrics - The article presents performance metrics for different inference modes, showing that deterministic modes yield consistent outputs across various batch sizes, with unique output counts significantly reduced [10][11]. - In terms of end-to-end latency, deterministic inference shows a performance drop of 25% to 45%, with specific backend performance metrics indicating improvements in certain configurations [12][13]. Group 3: Future Developments - Future efforts will focus on optimizing batch invariant operators to enhance performance, particularly for RL inference, and expanding support to mixture of experts (MoE) models [16][18]. - The team aims to improve radix cache functionality and explore tensor parallelism to further enhance the capabilities of deterministic inference [18].
从现有主流 RL 库来聊聊RL Infra架构演进
自动驾驶之心· 2025-09-25 23:33
Core Viewpoint - Reinforcement Learning (RL) is transitioning from a supportive technology to a core driver of model capabilities, focusing on multi-step, interactive agent training to achieve General Artificial Intelligence (AGI) [2][6]. Group 1: Modern RL Infrastructure Architecture - The core components of modern RL infrastructure include a Generator, which interacts with the environment to generate trajectories and calculate rewards, and a Trainer, which updates model parameters based on trajectory data [6][4]. - The generator-trainer architecture, combined with distributed coordination layers like Ray, forms the "gold standard" for RL systems [6][4]. Group 2: Primary Development - Primary Development frameworks serve as foundational frameworks for building RL training pipelines, providing core algorithm implementations and integration with underlying training/inference engines [8][7]. - TRL (Transformer Reinforcement Learning) is a user-friendly RL framework launched by Hugging Face, offering various algorithm supports [9][10]. - OpenRLHF, developed by a collaborative team including ByteDance and NetEase, aims to provide an efficient and scalable RLHF and Agentic RL framework [11][14]. - veRL, developed by Byte's Seed team, is one of the most comprehensive frameworks with extensive algorithm support [16][19]. - AReaL (Asynchronous Reinforcement Learning) is designed for large-scale, high-throughput RL training with a fully asynchronous architecture [20][21]. - NeMo-RL, launched by NVIDIA, integrates into its extensive NeMo ecosystem, focusing on production-level RL frameworks [24][28]. - ROLL, an Alibaba open-source framework, emphasizes asynchronous and Agentic capabilities for large-scale LLM RL [30][33]. - slime, developed by Tsinghua and Zhipu, is a lightweight framework focusing on seamless integration of SGLang with Megatron [34][36]. Group 3: Secondary Development - Secondary Development frameworks are built on primary frameworks, targeting specific downstream application scenarios like multi-modal, multi-agent, and GUI automation [44][3]. - Agentic RL frameworks, such as verl-agent, optimize for asynchronous rollout and training, addressing the core challenges of multi-round interactions with external environments [46][47]. - Multimodal RL frameworks, like VLM-R1 and EasyR1, focus on training visual-language reasoning models, addressing data processing and loss function design challenges [53][54]. - Multi-Agent RL frameworks, such as MARTI, integrate multi-agent reasoning and reinforcement learning for complex collaborative tasks [59][60]. Group 4: Summary and Trends - The RL infrastructure is evolving from a "workshop" model to a "standardized pipeline," with increasing modularity in framework design [65]. - Asynchronous architectures are becoming essential to address the computational asymmetry between rollout and training [66]. - The emergence of high-performance inference engines like vLLM and SGLang significantly accelerates the rollout process [66]. - The evolution from RLHF to Agentic RL reflects the growing complexity of tasks supported by new frameworks [66]. - Distributed training framework choices, such as Megatron-LM and DeepSpeed, are critical for large-scale model training [66]. - Scene-driven secondary development frameworks are addressing unique challenges in vertical domains [66]. - The importance of orchestrators for managing distributed components in RL systems is becoming widely recognized [66].
智谱终于发布GLM-4.5技术报告,从预训练到后训练,细节大公开
机器之心· 2025-08-11 07:12
Core Viewpoint - The article highlights the release of GLM-4.5 and GLM-4.5-Air, which integrate reasoning, coding, and agentic capabilities into a single model, achieving the highest ranking among domestic and open-source models in 12 global benchmarks [2][11][19]. Group 1: Model Performance and Reception - GLM-4.5 achieved third place in global rankings across 12 recognized benchmarks, outperforming all domestic and open-source models [2][19]. - The model's announcement generated significant attention, with over 1.2 million views on social media and topping the Hugging Face trends for seven consecutive days [2][3]. - The technical report for GLM-4.5 was voted as the "1 Paper of the day" by Hugging Face users [13]. Group 2: Technical Innovations - GLM-4.5 employs a MoE (Mixture of Experts) architecture, enhancing computational efficiency during training and inference [21][24]. - The model features a unique training process, including pre-training on 15 trillion tokens and mid-training on 7 trillion tokens, with a maximum sequence length expanded from 4K to 128K [25][27]. - The introduction of the slime framework supports efficient reinforcement learning training, addressing common bottlenecks in agentic tasks [31][34]. Group 3: Key Capabilities - GLM-4.5 integrates three core capabilities: agentic ability for real-world interaction, complex reasoning for multi-step problem-solving, and advanced coding skills for software engineering tasks [22][19]. - The model's performance in agentic tasks was evaluated against competitors, showing superior results in benchmarks like TAU-bench and BFCL V3 [44]. - In reasoning tasks, GLM-4.5 outperformed OpenAI's models in several benchmarks, including AIME 24 and SciCode [47][50]. Group 4: Code Task Performance - GLM-4.5 excelled in code-related benchmarks, outperforming GPT-4.1 and Claude Sonnet 4 in SWE-bench Verified and Terminal-Bench [52][53]. - The model's overall performance in coding tasks positions it as a strong competitor to Claude Sonnet 4 [53]. Group 5: Future Implications - The release of the technical report provides insights into the development direction for domestic open-source large models, serving as a key reference for future research [56][57].