Workflow
低精度训练
icon
Search documents
为什么BF16的FlashAttention会把训练「炸掉」?清华首次给出机制解释,用极简改动稳住训练
机器之心· 2026-03-03 23:19
Core Insights - The article discusses the phenomenon of instability in low-precision training, particularly in BF16, and identifies that FlashAttention does not randomly produce bugs but triggers specific numerical biases under certain conditions, leading to loss explosions [1][4]. Group 1: Background and Importance - Low-precision training has become a necessity in the industry, with BF16/FP16 being widely used to enhance training efficiency, but this approach can lead to instability as precision approaches its limits [2][3]. - FlashAttention is a critical component for training long-context models, yet it has been associated with reproducible but unexplained failure cases, which have been reported over the years without a clear mechanism linking numerical errors to loss explosions [4]. Group 2: Research Methodology - The authors conducted a rigorous reproduction of failures using GPT-2, eliminating randomness by recording and replaying the same data batch sequences [6]. - They narrowed down the issue to specific layers and attention heads using spectral norm and other metrics, identifying that the instability stemmed from a particular intermediate quantity in FlashAttention's backpropagation [7]. Group 3: Mechanism of Failure - The article explains that similar low-rank structures can amplify numerical errors, turning them into persistent biases rather than mere noise, which leads to abnormal growth in weight updates and ultimately causes loss explosions [8][9]. - A critical observation was made regarding systematic biases in BF16, particularly when multiple identical maximum values appear in a score row, which can trigger dangerous conditions in subsequent calculations [13][18]. Group 4: Proposed Solutions - The authors suggest a straightforward fix: adjusting the safe softmax implementation to ensure that the maximum values in a row are strictly less than 1, which prevents the triggering of subsequent biases in BF16 accumulation [22][25]. - Experimental results demonstrated that using the modified FlashAttention allowed stable training without sudden loss explosions across various hardware setups [26]. Group 5: Broader Implications - The findings emphasize that low-precision errors should not be treated as random noise, as they can form systematic biases under specific distributions and discrete events [31]. - The article also highlights that model structures can amplify these biases, particularly through similar low-rank update directions in attention mechanisms, which facilitate the accumulation of errors in the same direction [31].
致敬Kimi K2:基于slime的全流程INT4量化感知RL训练
机器之心· 2026-02-03 10:35
Core Insights - The SGLang RL team has successfully implemented the INT4 Quantization-Aware Training (QAT) process inspired by the Kimi K2 team, achieving stability and consistency comparable to BF16 full precision training while enabling extreme compression of large models [2][3][4]. Technical Overview - The project is a collaboration among multiple teams, including SGLang RL, InfiXAI, Ant Group, and others, with functionalities shared in the slime and Miles communities [4]. - A complete QAT INT4 closed-loop solution has been established, enhancing training stability and efficiency in reinforcement learning (RL) scenarios [6]. - The rollout efficiency has significantly improved by eliminating cross-machine communication bottlenecks, allowing 1TB models to fit within a single H200 (141G) GPU memory [6][10]. Training Process - The training phase utilizes Fake Quantization to simulate quantization noise while maintaining high precision BF16 weights, ensuring the model adapts to low precision representations [8][9]. - The Straight-Through Estimator (STE) technique allows gradients to bypass the non-differentiable quantization operations, maintaining the training continuity [9][11]. - The transition from BF16 weights to INT4 format is executed during the weight conversion phase, facilitating efficient inference [10][25]. Performance Evaluation - Experiments demonstrate that the QAT INT4 training approach maintains robust performance, with the rollout configuration showing consistent growth in raw rewards compared to BF16 and FP8 configurations [41][46]. - The INT4 QAT strategy effectively mitigates discrepancies between training and inference outputs, achieving a high degree of consistency [51][56]. Future Directions - The project aims to explore further optimizations to enhance training efficiency and investigate the application of FP4 precision in RL training and inference as NVIDIA's Blackwell architecture becomes more prevalent [58][62].