Mira Murati 创业公司首发长文，尝试解决 LLM 推理的不确定性难题

Core Insights - The article discusses the challenges of achieving reproducibility in large language model (LLM) inference, highlighting that even with the same input, different outputs can occur due to the probabilistic nature of the sampling process [10][11] - It introduces the concept of "batch invariance" in LLM inference, emphasizing the need for consistent results regardless of batch size or concurrent requests [35][40] Group 1 - Thinking Machines Lab, founded by former OpenAI CTO Mira Murati, has launched a blog series called "Connectionism" to share insights on AI research [3][8] - The blog's first article addresses the non-determinism in LLM inference, explaining that even with a temperature setting of 0, results can still vary [10][12] - The article identifies floating-point non-associativity and concurrency as key factors contributing to the uncertainty in LLM outputs [13][24] Group 2 - The article explains that the assumption of "concurrency + floating-point" as the sole reason for non-determinism is incomplete, as many operations in LLMs can be deterministic [14][16] - It discusses the importance of understanding the implementation of kernel functions in GPUs, which can lead to unpredictable results due to the lack of synchronization among processing cores [25][29] - The article emphasizes that most LLM operations do not require atomic addition, which is often a source of non-determinism, thus allowing for consistent outputs during forward propagation [32][33] Group 3 - The concept of batch invariance is explored, indicating that the results of LLM inference can be affected by the batch size and the order of operations, leading to inconsistencies [36][40] - The article outlines strategies to achieve batch invariance in key operations like RMSNorm, matrix multiplication, and attention mechanisms, ensuring that outputs remain consistent regardless of batch size [42][60][64] - It concludes with a demonstration of deterministic inference using batch-invariant kernel functions, showing that consistent outputs can be achieved with the right implementation [74][78]