Workflow
硬件的非对称扩展
icon
Search documents
FlashAttention-4正式发布:算法流水线大改,矩阵乘法级速度
机器之心· 2026-03-06 04:31
Core Insights - FlashAttention-4 has officially launched after a year of development, marking a significant update in the deep learning optimization technology [1] - The core author, Tri Dao, highlights that the execution speed of the attention mechanism is now nearly as fast as matrix multiplication on Blackwell GPUs [1] Hardware Trends - The AI industry is rapidly transitioning to Blackwell architecture systems, such as B200 and GB200, which exhibit asymmetric hardware scaling [5] - The throughput of Tensor Cores has increased significantly, with a 2.25 times increase from Hopper H100 to Blackwell B200, while shared memory bandwidth remains relatively unchanged [6] Attention Mechanism Optimization - FlashAttention-4 aims to maximize the overlap between matrix multiplication and other bottleneck resources, achieving up to 1605 TFLOPs/s on B200 (BF16) with a utilization rate of 71% [10] - The new algorithm includes mechanisms to overcome bottlenecks, such as polynomial approximations for exponential functions and a new online softmax that avoids 90% of unnecessary rescaling [1][10] Collaborative Design Features - The design leverages Blackwell's new hardware features, including Tensor Memory (TMEM) and fully asynchronous fifth-generation Tensor Cores, to enhance performance [12] - The introduction of 2-CTA MMA allows for shared execution of UMMA operations across two CTA, reducing redundant data transfer and resource usage [13] Performance Benchmarking - FlashAttention-4 demonstrates superior performance in forward and backward passes compared to cuDNN 9.13 and Triton, with speed improvements of 1.1–1.3 times and 2.1–2.7 times, respectively [19] - The performance results indicate that FlashAttention-4 can significantly enhance the efficiency of attention mechanisms in long-sequence scenarios [19] Community Impact - The release of FlashAttention-4 has generated significant interest, with PyTorch announcing support for its backend, allowing researchers to prototype custom attention variants more efficiently [24][26] - Users can achieve performance improvements of 1.2 to 3.2 times compared to Triton under constrained workloads, eliminating the trade-off between flexibility and high performance [28]