Nvidia-FlashAttention-4震撼来袭，原生支持Blackwell GPU，英伟达的护城河更深了？

Core Insights - FlashAttention-4 was announced by Tri Dao, the Chief Scientist of TogetherAI, at the Hot Chips 2025 semiconductor conference, showcasing significant advancements in attention mechanisms for AI models [1][2]. Performance Improvements - FlashAttention-4 achieves up to 22% faster performance on Blackwell compared to NVIDIA's cuDNN library [2]. - The new version incorporates two key algorithmic improvements, including a novel online softmax algorithm that skips 90% of output rescaling and uses software simulation for exponential calculations to enhance throughput [6][9]. Technical Enhancements - The implementation of CUTLASS CuTe-DSL allows for better performance, with Tri Dao's kernel outperforming NVIDIA's latest cuBLAS 13.0 library in specific computation scenarios [5][9]. - FlashAttention-4 supports native execution on Blackwell GPUs, addressing previous compilation and performance issues [19]. Historical Context - FlashAttention was first introduced in 2022, focusing on reducing memory complexity from O(N²) to O(N) by utilizing a tiling and softmax rescaling strategy [11]. - Subsequent versions, including FlashAttention-2 and FlashAttention-3, have progressively improved speed and efficiency, with FlashAttention-3 achieving up to 740 TFLOPS on H100 GPUs [18][19]. Market Implications - The advancements in FlashAttention technology may pose challenges for competitors like AMD, as Tri Dao's team primarily utilizes NVIDIA GPUs and has not engaged with AMD's ROCm ecosystem [9]. - There is speculation that AMD could invest significantly to enhance its GPU ecosystem, potentially offering financial incentives to attract developers like Tri Dao [9]. Community Engagement - The FlashAttention GitHub repository has garnered over 19,100 stars, indicating strong community interest and engagement [23].