英伟达最强B200算力浪费60%！普林斯顿团队出手，利用率升至71%

Core Viewpoint - The article discusses the inefficiencies associated with NVIDIA's Blackwell B200 GPU, highlighting that due to hardware and software compatibility issues, 60% of computational resources are wasted [1][15]. Group 1: Performance Issues - The Blackwell B200 GPU has a tensor core computing power of 2.25 PFLOPS, which is double that of the previous Hopper H100 generation [7]. - Despite the increase in core computing power, the supporting computational units have not improved, leading to a performance bottleneck [12]. - The time taken for memory read/write operations and exponential calculations now exceeds that of matrix multiplication by 25%-60%, causing significant resource idling [13][14]. Group 2: FlashAttention-4 Solution - FlashAttention-4, developed by a team including Tri Dao and Meta, aims to address the performance bottlenecks of the Blackwell GPU [4][5]. - The algorithm increases utilization rates from the industry standard of 20%-30% to 71% [4][32]. - It employs three main optimization strategies: 1. Software simulation of exponential functions to enhance throughput and conditional softmax rescaling to reduce unnecessary computations [18][19]. 2. A restructured computation pipeline that maximizes parallelism by overlapping softmax calculations with matrix multiplications [23][24]. 3. Consideration for future hardware upgrades to ensure ongoing compatibility and optimization [27][28]. Group 3: Development Efficiency - FlashAttention-4 is developed entirely in Python using the CuTe-DSL framework, eliminating C++ code and significantly improving compilation efficiency [29]. - Compilation times for forward and backward passes have been reduced by up to 30 times compared to FlashAttention-3, with forward pass times dropping from 55 seconds to 2.5 seconds [30][32]. Group 4: Competitive Advantage - FlashAttention-4 demonstrates superior performance compared to NVIDIA's cuDNN 9.13, being 1.1-1.3 times faster, and 2.1-2.7 times faster than the Triton framework [34]. - The algorithm shows particularly strong performance in core scenarios such as long sequences and causal masking during model training and inference [37].