Nvidia-英伟达最强B200算力浪费60%，普林斯顿团队出手，利用率升至71%

Core Insights - The new Blackwell B200 GPU from NVIDIA is experiencing significant computational resource wastage due to hardware-software compatibility issues, leading to a 60% underutilization of its capabilities [1][4]. Group 1: GPU Performance and Issues - The Blackwell B200 GPU features a tensor core computing power of 2.25 PFLOPS, which is double that of the previous Hopper H100 generation, theoretically allowing for substantial improvements in attention computation speed [3]. - Despite the increase in core computing power, the supporting computational units have not improved, leading to a performance bottleneck where memory read/write operations and exponential calculations take 25%-60% longer than matrix multiplication [3][4]. Group 2: FlashAttention-4 Solutions - FlashAttention-4, developed by a team including Tri Dao and Meta, addresses the performance bottlenecks of the Blackwell GPU with three optimization strategies [2][4]. - The first strategy involves enhancing throughput for exponential calculations by simulating the exponential function using polynomial approximation, allowing faster FMA units to handle tasks previously managed by the MUFU unit [6]. - The second strategy focuses on restructuring the computation pipeline to maximize parallel processing, enabling softmax calculations to overlap with matrix multiplication [9]. - The third strategy prepares for future hardware iterations by considering the upcoming B300/GB300 GPU's improvements in throughput for exponential operations [11]. Group 3: Development Efficiency - FlashAttention-4 has transitioned from C++ to a Python-based domain-specific language (CuTe-DSL), resulting in a significant increase in compilation speed, with forward propagation reduced from 55 seconds to 2.5 seconds and backward propagation from 45 seconds to 1.4 seconds [12][13]. - The practical performance on the B200 GPU shows a peak forward propagation capability of 1613 TFLOPS/s, achieving a 71% theoretical peak utilization rate [13]. Group 4: Competitive Advantage - FlashAttention-4 demonstrates a performance advantage over NVIDIA's cuDNN 9.13 by being 1.1-1.3 times faster and 2.1-2.7 times faster than the commonly used Triton framework, particularly in core scenarios like long sequences and causal masking [15].