FlashAttention-4震撼来袭，原生支持Blackwell GPU，英伟达的护城河更深了？

Core Viewpoint - FlashAttention-4, introduced by Tri Dao at the Hot Chips 2025 conference, demonstrates significant performance improvements over previous versions and competitors, particularly in the context of NVIDIA's GPU architecture [1][2][10]. Summary by Sections FlashAttention-4 Introduction - FlashAttention-4 is reported to be up to 22% faster than NVIDIA's cuDNN library implementation on the Blackwell architecture [2]. - The new version incorporates two key algorithmic improvements: a new online softmax algorithm that skips 90% of output rescaling and a software simulation for better throughput [4][5]. Performance Enhancements - The kernel developed by Tri Dao's team outperforms NVIDIA's latest cuBLAS 13.0 library in specific computation scenarios, particularly when the reduction dimension K is small [7]. - FlashAttention-4 utilizes CUTLASS CuTe Python DSL, which is significantly more challenging to port to ROCm HIP compared to CUDA C++ [6]. Competitive Landscape - The development of FlashAttention is seen as a core advantage for NVIDIA, as Tri Dao and his team primarily use NVIDIA GPUs and have open-sourced much of their work for the developer community [10]. - There are implications for AMD, suggesting that financial incentives may be necessary to encourage Tri Dao's team to develop for ROCm [10]. Historical Context and Evolution - FlashAttention was first introduced in 2022, addressing the quadratic time and memory overhead of traditional attention mechanisms by reducing memory complexity from O(N²) to O(N) [12]. - Subsequent versions, including FlashAttention-2 and FlashAttention-3, have continued to enhance performance, with FlashAttention-2 achieving speed improvements of 2-4 times over its predecessor [21]. Technical Innovations - FlashAttention-3 achieved a speed increase of 1.5-2.0 times over FlashAttention-2, reaching up to 740 TFLOPS on H100 GPUs [23]. - FlashAttention-4 introduces native support for Blackwell GPUs, addressing previous compilation and performance issues [24]. Community Engagement - The GitHub repository for FlashAttention has garnered over 19,100 stars, indicating strong community interest and engagement [25].