FlashAttention

Search documents
FlashAttention-4震撼来袭,原生支持Blackwell GPU,英伟达的护城河更深了?
3 6 Ke· 2025-08-26 12:41
Core Insights - FlashAttention-4 was announced by Tri Dao, the Chief Scientist of TogetherAI, at the Hot Chips 2025 semiconductor conference, showcasing significant advancements in attention mechanisms for AI models [1][2]. Performance Improvements - FlashAttention-4 achieves up to 22% faster performance on Blackwell compared to NVIDIA's cuDNN library [2]. - The new version incorporates two key algorithmic improvements, including a novel online softmax algorithm that skips 90% of output rescaling and uses software simulation for exponential calculations to enhance throughput [6][9]. Technical Enhancements - The implementation of CUTLASS CuTe-DSL allows for better performance, with Tri Dao's kernel outperforming NVIDIA's latest cuBLAS 13.0 library in specific computation scenarios [5][9]. - FlashAttention-4 supports native execution on Blackwell GPUs, addressing previous compilation and performance issues [19]. Historical Context - FlashAttention was first introduced in 2022, focusing on reducing memory complexity from O(N²) to O(N) by utilizing a tiling and softmax rescaling strategy [11]. - Subsequent versions, including FlashAttention-2 and FlashAttention-3, have progressively improved speed and efficiency, with FlashAttention-3 achieving up to 740 TFLOPS on H100 GPUs [18][19]. Market Implications - The advancements in FlashAttention technology may pose challenges for competitors like AMD, as Tri Dao's team primarily utilizes NVIDIA GPUs and has not engaged with AMD's ROCm ecosystem [9]. - There is speculation that AMD could invest significantly to enhance its GPU ecosystem, potentially offering financial incentives to attract developers like Tri Dao [9]. Community Engagement - The FlashAttention GitHub repository has garnered over 19,100 stars, indicating strong community interest and engagement [23].
FlashAttention-4震撼来袭,原生支持Blackwell GPU,英伟达的护城河更深了?
机器之心· 2025-08-26 09:38
Core Viewpoint - FlashAttention-4, introduced by Tri Dao at the Hot Chips 2025 conference, demonstrates significant performance improvements over previous versions and competitors, particularly in the context of NVIDIA's GPU architecture [1][2][10]. Summary by Sections FlashAttention-4 Introduction - FlashAttention-4 is reported to be up to 22% faster than NVIDIA's cuDNN library implementation on the Blackwell architecture [2]. - The new version incorporates two key algorithmic improvements: a new online softmax algorithm that skips 90% of output rescaling and a software simulation for better throughput [4][5]. Performance Enhancements - The kernel developed by Tri Dao's team outperforms NVIDIA's latest cuBLAS 13.0 library in specific computation scenarios, particularly when the reduction dimension K is small [7]. - FlashAttention-4 utilizes CUTLASS CuTe Python DSL, which is significantly more challenging to port to ROCm HIP compared to CUDA C++ [6]. Competitive Landscape - The development of FlashAttention is seen as a core advantage for NVIDIA, as Tri Dao and his team primarily use NVIDIA GPUs and have open-sourced much of their work for the developer community [10]. - There are implications for AMD, suggesting that financial incentives may be necessary to encourage Tri Dao's team to develop for ROCm [10]. Historical Context and Evolution - FlashAttention was first introduced in 2022, addressing the quadratic time and memory overhead of traditional attention mechanisms by reducing memory complexity from O(N²) to O(N) [12]. - Subsequent versions, including FlashAttention-2 and FlashAttention-3, have continued to enhance performance, with FlashAttention-2 achieving speed improvements of 2-4 times over its predecessor [21]. Technical Innovations - FlashAttention-3 achieved a speed increase of 1.5-2.0 times over FlashAttention-2, reaching up to 740 TFLOPS on H100 GPUs [23]. - FlashAttention-4 introduces native support for Blackwell GPUs, addressing previous compilation and performance issues [24]. Community Engagement - The GitHub repository for FlashAttention has garnered over 19,100 stars, indicating strong community interest and engagement [25].