Z Tech ｜ LMSYS 团队发布大规模 MoE 强化学习框架 Miles，不积跬步无以至千里

Core Insights - The article introduces Miles, a new reinforcement learning framework designed for enterprise-level large-scale MoE training and production workloads, developed by the LMSYS team as a fork of the lightweight framework slime [1][4]. Group 1: Framework Features - Miles inherits the lightweight and modular design principles of slime, making it a preferred tool for model scientists exploring algorithms [3]. - It implements Infrastructure-level True On-Policy to eliminate discrepancies between training and inference, achieving bit-wise consistency [5]. - The framework introduces speculative training through MTP Online Training, resulting in over 25% rollout acceleration [3][9]. Group 2: Memory Optimization - Miles incorporates advanced memory management techniques to maximize GPU performance without triggering out-of-memory (OOM) errors [8]. - It features online SFT for Draft Models, which enhances performance by preventing a decline in acceptance length during training [9]. - The framework includes mechanisms to avoid benign OOM errors and implements memory margin strategies to address NCCL-related OOM issues [10]. Group 3: Technical Upgrades - Miles supports full-stack optimization for SGLang and Megatron, ensuring compatibility with rapid iterations in training and inference frameworks [6]. - The modular design allows researchers to easily modify components like algorithms, data, sampling, and evaluation with minimal code changes [6]. - It provides a user-friendly interface for model scientists, allowing them to adjust important sampling or loss dynamics without delving into lower-level code [6]. Group 4: Future Development - The LMSYS team plans to enhance the FSDP backend for improved stability in large-scale distributed training [14]. - Future developments include independent rollout deployment, additional debugging tools, and formal mathematical verification for SFT/RL scripts [14]. - The roadmap also aims to support next-generation hardware like GB300 and expand capabilities for multi-modal training [18].