Workflow
Blackwell B200s
icon
Search documents
Cursor为Blackwell从零构建MXFP8内核,MoE层提速3.5倍,端到端训练提速1.5倍
机器之心· 2025-08-22 04:58
Core Insights - The article discusses the challenges and solutions encountered by Cursor when upgrading from NVIDIA's Hopper H100s to the new Blackwell B200s GPU architecture, highlighting the inefficiencies in the MoE (Mixture of Experts) training layer that hindered performance despite hardware improvements [2][20]. Group 1: Performance Bottlenecks - The upgrade to Blackwell B200s resulted in a hardware performance increase, but the actual training speed was slowed down by inefficiencies in the MoE layer, leading to a paradox where performance gains were not realized [2]. - Cursor's solution involved rewriting the MoE training layer from scratch at the GPU kernel level, which eliminated bottlenecks and fully utilized the Blackwell architecture's potential [2][21]. Group 2: Technical Innovations - Cursor designed a data flow pipeline specifically targeting TMEM's new features to avoid unnecessary register movement overhead, integrating quantization and dequantization logic into the kernel computation process to significantly reduce memory bandwidth usage [3][9]. - The MXFP8 quantization method was developed to maintain precision while benefiting from low-precision computation, allowing for effective scaling of data blocks [11][24]. Group 3: Performance Metrics - The MoE layer achieved a 3.5x speedup in both forward and backward propagation, with end-to-end training speed on Blackwell being 1.5x faster compared to the original Hopper GPU setup, resulting in a total acceleration of 2x [2]. - The throughput for FP8 Tensor Core on Blackwell reached 4,500 TFLOP/s, while the FP32 CUDA Core throughput was 80 TFLOP/s, indicating significant improvements in processing capabilities [16]. Group 4: Optimization Strategies - Cursor implemented a complex data pipeline utilizing techniques such as "Warp specialization" and 2-CTA (Cooperative Thread Array) mode, which allowed for efficient parallel processing and reduced memory traffic, leading to a 15-20% performance improvement [22][23]. - The custom MXFP8 quantization kernel developed by Cursor achieved a sustained memory bandwidth of over 6.2 TB/s, outperforming existing open-source tools [24][26]. Group 5: Training Efficiency - The training loss curves for MXFP8 and BF16 formats showed nearly indistinguishable results, indicating that performance enhancements did not compromise accuracy [27][30]. - The quantization process was identified as a significant performance killer, with the overhead of data quantization and dequantization consuming a large portion of the computation time [17][18].