Workflow
低精度训练
icon
Search documents
为什么BF16的FlashAttention会把训练「炸掉」?清华首次给出机制解释,用极简改动稳住训练
机器之心· 2026-03-03 23:19
一句话总结: 社区里困扰了多年的一个 "玄学" 现象终于被拆解清楚了:在 BF16 等低精度训练里,FlashAttention 不是随机出 bug,而是会在特定条件下触发 有方 向的数值偏置 ,借助注意力中涌现的 相似低秩更新方向 被持续放大,最终把权重谱范数和激活推到失控,导致 loss 突然爆炸。论文还给出一个几乎不改模型、只 在 safe softmax 里做的 极小修改 ,实测能显著稳定训练。 因果链总览(论文 Figure 1 ) 标题:Why Low-Precision Transformer Training Fails: An Analysis on Flash Attention 背景:低精度训练越来越 "刚需",但注意力比你想的更敏感 大模型训练的现实是:显存和吞吐决定一切。工业界普遍在混合精度里使用 BF16/FP16,甚至把 FFN 推到 FP8,以换取更高的训练效率。但工程实践同样残酷:越 接近 "极限精度",训练越容易出现难以解释的不稳定。 Flash Attention 是长上下文训练的关键加速组件,几乎成了标配。问题在于,社区长期存在一个可复现却难以解释的失败案例: 这类问题 ...
致敬Kimi K2:基于slime的全流程INT4量化感知RL训练
机器之心· 2026-02-03 10:35
Core Insights - The SGLang RL team has successfully implemented the INT4 Quantization-Aware Training (QAT) process inspired by the Kimi K2 team, achieving stability and consistency comparable to BF16 full precision training while enabling extreme compression of large models [2][3][4]. Technical Overview - The project is a collaboration among multiple teams, including SGLang RL, InfiXAI, Ant Group, and others, with functionalities shared in the slime and Miles communities [4]. - A complete QAT INT4 closed-loop solution has been established, enhancing training stability and efficiency in reinforcement learning (RL) scenarios [6]. - The rollout efficiency has significantly improved by eliminating cross-machine communication bottlenecks, allowing 1TB models to fit within a single H200 (141G) GPU memory [6][10]. Training Process - The training phase utilizes Fake Quantization to simulate quantization noise while maintaining high precision BF16 weights, ensuring the model adapts to low precision representations [8][9]. - The Straight-Through Estimator (STE) technique allows gradients to bypass the non-differentiable quantization operations, maintaining the training continuity [9][11]. - The transition from BF16 weights to INT4 format is executed during the weight conversion phase, facilitating efficient inference [10][25]. Performance Evaluation - Experiments demonstrate that the QAT INT4 training approach maintains robust performance, with the rollout configuration showing consistent growth in raw rewards compared to BF16 and FP8 configurations [41][46]. - The INT4 QAT strategy effectively mitigates discrepancies between training and inference outputs, achieving a high degree of consistency [51][56]. Future Directions - The project aims to explore further optimizations to enhance training efficiency and investigate the application of FP4 precision in RL training and inference as NVIDIA's Blackwell architecture becomes more prevalent [58][62].