Mamba作者团队提出SonicMoE：一个Token舍入，让MoE训练速度提升近2倍

Core Insights - The MoE (Mixture of Experts) model has become the standard architecture for scaling language models without significantly increasing computational costs, showing trends of higher expert granularity and sparsity, which enhance model quality per unit FLOPs [1][2] MoE Model Trends - Recent open-source models like DeepSeek V3, Kimi K2, and Qwen3 MoE exhibit finer-grained expert designs and higher sparsity, significantly increasing total parameter count while maintaining the number of active parameters [1][2] - The table of recent models indicates varying parameters, expert activation ratios, and expert granularities, with models like Mixtral 8x22B having 131 billion parameters and a 25% expert activation ratio [2] Hardware Efficiency Challenges - The pursuit of extreme granularity and sparsity in MoE designs has led to significant hardware efficiency issues, prompting the development of SonicMoE, a solution tailored for NVIDIA Hopper and Blackwell architecture GPUs [3] - SonicMoE demonstrates performance advantages, achieving a 43% speed increase in forward propagation and up to 115% in backward propagation compared to existing baselines [3] Memory and IO Bottlenecks - Fine-grained MoE models face linear growth in activation memory usage with the number of active experts, leading to increased memory pressure during forward and backward propagation [4] - The reduced arithmetic intensity in smaller, dispersed experts results in more frequent IO access, pushing model training into a memory-constrained zone [4] Efficient Algorithms - SonicMoE introduces a method to compute routing gradients without caching activation values, reducing backward propagation memory usage by 45% for fine-grained models [4] - The design allows for overlapping computation and IO operations, effectively masking high IO latency associated with fine-grained MoE [4] Token Rounding Strategy - The token rounding method optimizes the distribution of tokens to experts, minimizing computational waste due to tile quantization effects, thus enhancing training efficiency without compromising model quality [4][20][26] Performance Metrics - SonicMoE achieves a training throughput of 213 billion tokens per day using 64 H100 GPUs, comparable to the efficiency of 96 H100 GPUs running ScatterMoE [6] - The memory usage for activation remains constant even as expert granularity increases, with efficiency improvements ranging from 0.20 to 1.59 times over existing baselines [9][15] Open Source Contribution - The team has open-sourced the relevant kernel code, providing a robust tool for the large model community to accelerate high-performance MoE training [7]