确定性推理
Search documents
首个开源实现100%可复现的稳定RL训练框架来了!2次结果完全重合
量子位· 2025-09-27 01:30
Core Insights - The article discusses the achievement of SGLang and slime teams in creating a fully reproducible and stable reinforcement learning (RL) training framework based on the Qwen3-8B model, addressing the issue of non-deterministic outputs in large language model (LLM) inference [1][2][6]. Group 1: Deterministic Inference - SGLang and slime teams have developed a deterministic inference solution that integrates batch invariant operators, CUDA Graph, radix cache, and chunked prefill, ensuring high performance while maintaining compatibility with key features [5][8]. - The implementation of batch invariant operators addresses the core issue of output uncertainty in LLM inference, which arises from varying batch sizes during dynamic batching [7][8]. - Testing has shown that the average performance drop for SGLang's solution is 34.35%, significantly better than the 61.5% decline reported by Thinking Machines Lab [5][12]. Group 2: Performance Metrics - The article presents performance metrics for different inference modes, showing that deterministic modes yield consistent outputs across various batch sizes, with unique output counts significantly reduced [10][11]. - In terms of end-to-end latency, deterministic inference shows a performance drop of 25% to 45%, with specific backend performance metrics indicating improvements in certain configurations [12][13]. Group 3: Future Developments - Future efforts will focus on optimizing batch invariant operators to enhance performance, particularly for RL inference, and expanding support to mixture of experts (MoE) models [16][18]. - The team aims to improve radix cache functionality and explore tensor parallelism to further enhance the capabilities of deterministic inference [18].
融资20亿美元的Thinking Machines Lab首次公开:破解LLM随机性,实现可复现的“确定性”推理
锦秋集· 2025-09-11 09:19
Core Insights - The article discusses the fundamental issue of reproducibility in large language models (LLMs), attributing the uncertainty in inference results not to "concurrent computation and floating-point errors" but to the lack of "Batch Invariance" in core computational operators [1][7][11]. Group 1: Problem Identification - The article identifies that inference servers dynamically batch user requests, leading to results that depend on the batch size and composition, introducing inherent uncertainty [1][29]. - It challenges the common belief that floating-point non-associativity is the primary cause of uncertainty, suggesting that the real issue lies in how kernel functions are implemented [20][21]. Group 2: Proposed Solutions - The authors propose a solution by rewriting key computational modules in the Transformer model—RMSNorm, matrix multiplication, and attention mechanisms—to ensure they possess "Batch Invariance," thus making the computation independent of batch size [2][34]. - Experimental results demonstrate that with the new approach, repeated requests yield consistent results, contrasting with the previous method where 1000 identical requests produced 80 different outputs [2][75]. Group 3: Technical Implementation - The article details the implementation of batch-invariant RMSNorm, matrix multiplication, and attention mechanisms, emphasizing the need for consistent reduction strategies that do not depend on batch size [34][47][62]. - It highlights the challenges in maintaining batch invariance, particularly in attention mechanisms where the reduction order must remain consistent regardless of the number of tokens processed [66][72]. Group 4: Performance Analysis - The performance of the batch-invariant kernels is evaluated, showing a 20% performance loss compared to cuBLAS, but still maintaining acceptable efficiency for LLM inference [59][78]. - The article notes that while the performance of the batch-invariant implementation is not optimized, it remains viable for practical applications [78]. Group 5: Implications for Reinforcement Learning - The article discusses the implications of achieving deterministic inference for reinforcement learning (RL), enabling true on-policy RL by ensuring consistent results between training and inference [79][83]. - It emphasizes that achieving bitwise identical results between the sampler and trainer is crucial for effective RL training [80].