批次不变性

Search documents
融资20亿美元的Thinking Machines Lab首次公开:破解LLM随机性,实现可复现的“确定性”推理
锦秋集· 2025-09-11 09:19
Core Insights - The article discusses the fundamental issue of reproducibility in large language models (LLMs), attributing the uncertainty in inference results not to "concurrent computation and floating-point errors" but to the lack of "Batch Invariance" in core computational operators [1][7][11]. Group 1: Problem Identification - The article identifies that inference servers dynamically batch user requests, leading to results that depend on the batch size and composition, introducing inherent uncertainty [1][29]. - It challenges the common belief that floating-point non-associativity is the primary cause of uncertainty, suggesting that the real issue lies in how kernel functions are implemented [20][21]. Group 2: Proposed Solutions - The authors propose a solution by rewriting key computational modules in the Transformer model—RMSNorm, matrix multiplication, and attention mechanisms—to ensure they possess "Batch Invariance," thus making the computation independent of batch size [2][34]. - Experimental results demonstrate that with the new approach, repeated requests yield consistent results, contrasting with the previous method where 1000 identical requests produced 80 different outputs [2][75]. Group 3: Technical Implementation - The article details the implementation of batch-invariant RMSNorm, matrix multiplication, and attention mechanisms, emphasizing the need for consistent reduction strategies that do not depend on batch size [34][47][62]. - It highlights the challenges in maintaining batch invariance, particularly in attention mechanisms where the reduction order must remain consistent regardless of the number of tokens processed [66][72]. Group 4: Performance Analysis - The performance of the batch-invariant kernels is evaluated, showing a 20% performance loss compared to cuBLAS, but still maintaining acceptable efficiency for LLM inference [59][78]. - The article notes that while the performance of the batch-invariant implementation is not optimized, it remains viable for practical applications [78]. Group 5: Implications for Reinforcement Learning - The article discusses the implications of achieving deterministic inference for reinforcement learning (RL), enabling true on-policy RL by ensuring consistent results between training and inference [79][83]. - It emphasizes that achieving bitwise identical results between the sampler and trainer is crucial for effective RL training [80].
Mira Murati 创业公司首发长文,尝试解决 LLM 推理的不确定性难题
Founder Park· 2025-09-11 07:17
Core Insights - The article discusses the challenges of achieving reproducibility in large language model (LLM) inference, highlighting that even with the same input, different outputs can occur due to the probabilistic nature of the sampling process [10][11] - It introduces the concept of "batch invariance" in LLM inference, emphasizing the need for consistent results regardless of batch size or concurrent requests [35][40] Group 1 - Thinking Machines Lab, founded by former OpenAI CTO Mira Murati, has launched a blog series called "Connectionism" to share insights on AI research [3][8] - The blog's first article addresses the non-determinism in LLM inference, explaining that even with a temperature setting of 0, results can still vary [10][12] - The article identifies floating-point non-associativity and concurrency as key factors contributing to the uncertainty in LLM outputs [13][24] Group 2 - The article explains that the assumption of "concurrency + floating-point" as the sole reason for non-determinism is incomplete, as many operations in LLMs can be deterministic [14][16] - It discusses the importance of understanding the implementation of kernel functions in GPUs, which can lead to unpredictable results due to the lack of synchronization among processing cores [25][29] - The article emphasizes that most LLM operations do not require atomic addition, which is often a source of non-determinism, thus allowing for consistent outputs during forward propagation [32][33] Group 3 - The concept of batch invariance is explored, indicating that the results of LLM inference can be affected by the batch size and the order of operations, leading to inconsistencies [36][40] - The article outlines strategies to achieve batch invariance in key operations like RMSNorm, matrix multiplication, and attention mechanisms, ensuring that outputs remain consistent regardless of batch size [42][60][64] - It concludes with a demonstration of deterministic inference using batch-invariant kernel functions, showing that consistent outputs can be achieved with the right implementation [74][78]
刚刚,Thinking Machines Lab首次发长文,揭开LLM推理不确定性真相
机器之心· 2025-09-11 03:36
Core Viewpoint - The article discusses the challenges of achieving reproducibility in large language models (LLMs) due to the lack of batch invariance, which leads to nondeterministic outputs even under controlled conditions [10][41][46]. Group 1: Introduction to the Issue - Thinking Machines Lab, founded by former OpenAI CTO Mira Murati, published its first article addressing nondeterminism in LLM inference [1][3]. - The blog aims to cover a wide range of topics related to their research, including numerical computation and prompt engineering [3]. Group 2: Understanding Nondeterminism - Reproducibility is a cornerstone of scientific progress, yet obtaining consistent results from LLMs is challenging [10]. - Even with the temperature parameter set to 0, LLM APIs can still produce nondeterministic outputs [11]. - The nondeterminism is attributed to floating-point non-associativity and concurrency, which affects the order of operations in GPU computations [13][30]. Group 3: The Root Cause of Nondeterminism - The article argues that the common assumption linking concurrency and floating-point operations to nondeterminism does not fully explain the issue [14][30]. - Floating-point non-associativity leads to different results based on the order of operations, especially in parallel computations [19][26]. - The actual implementation of kernel functions in LLMs contributes to the nondeterministic behavior observed [27][30]. Group 4: Batch Invariance - The lack of batch invariance is identified as a key factor causing nondeterminism in LLM outputs [41][46]. - Batch size changes can lead to different results for the same input, which is counterintuitive for mathematical functions [43]. - The article emphasizes that ensuring kernel functions are batch invariant is crucial for achieving consistent outputs in LLM inference [46]. Group 5: Solutions for Achieving Determinism - The article outlines strategies to implement batch invariance in key operations such as RMSNorm, matrix multiplication, and attention mechanisms [49][60][71]. - By ensuring that the operations do not depend on batch size, the LLM inference can produce consistent results [46][81]. - The authors provide a demonstration of deterministic inference using their batch-invariant kernel function library [82]. Group 6: Performance Considerations - Initial performance tests indicate that while the batch-invariant kernel functions may not be fully optimized, they do not lead to catastrophic performance declines [89]. - The article highlights the importance of maintaining performance while achieving deterministic outputs in LLMs [88]. Group 7: Implications for Reinforcement Learning - The article discusses how achieving deterministic inference can facilitate true on-policy reinforcement learning by ensuring consistent outputs between training and inference [90]. - This consistency is essential for effective training and sampling processes in reinforcement learning environments [90]. Group 8: Conclusion - The article advocates for a proactive approach to understanding and addressing the sources of nondeterminism in LLMs, encouraging the community to strive for reproducibility in AI systems [93].
她们估值840亿,刚发了第一个AI成果
量子位· 2025-09-11 01:58
Core Insights - Thinking Machines, valued at $12 billion, has released its first research blog focusing on overcoming nondeterminism in large language model (LLM) inference [1][51]. - The research emphasizes the challenge of reproducibility in LLM outputs, attributing it to batch non-invariance [3][12]. Group 1: Research Focus - The main theme of the research is "Defeating Nondeterminism in LLM Inference," which addresses why LLM inference results are often non-reproducible [3][8]. - The root cause identified is batch non-invariance, where the output of a single request is influenced by the number of requests in the same batch [14][15]. Group 2: Technical Findings - The research indicates that floating-point non-associativity and concurrent execution lead to different results in LLM inference, but this explanation is incomplete [9][10]. - The study reveals that the lack of batch invariance is the primary issue, as dynamic adjustments to batch sizes during deployment affect the computation order of key operations [15][16]. Group 3: Proposed Solutions - To achieve batch invariance, the research suggests fixing the reduction order in operations like RMSNorm and matrix multiplication, regardless of batch size [18][19]. - The proposed method involves compiling a unified kernel configuration for all input shapes to avoid switching parallel strategies due to batch size changes, even if it results in a performance loss of about 20% [22][21]. Group 4: Experimental Validation - Three types of experiments were conducted to validate the findings: inference determinism verification, performance verification, and real online policy reinforcement learning application verification [25]. - Results showed that using batch invariant kernels led to 1000 identical outputs, achieving deterministic inference, while non-invariant kernels produced 80 different results [27][28]. Group 5: Company Background - Thinking Machines was co-founded by Mira Murati, former CTO of OpenAI, and includes a team of notable figures from the AI industry, primarily from OpenAI [36][38][46]. - The company recently completed a $2 billion seed funding round, setting a record for AI funding, and is now valued at $12 billion despite not having any product yet [51][50].