Core Viewpoint - Huawei defines the benchmark for domestic large model training through technological innovation, achieving breakthroughs in computing power utilization and post-training throughput [1][4]. Group 1: Technological Innovations - Huawei's "Ascend + Pangu Ultra MoE" combination has unlocked a fully controllable training loop for domestic computing power and models, achieving industry-leading performance in cluster training systems [4][5]. - The pre-training phase saw the Ascend Atlas 800T A2 cluster's model training utilization (MFU) increase to 41%, while the post-training phase achieved a throughput of 35K Tokens/s on a single CloudMatrix 384 super node [5][36]. - Huawei disclosed key technologies in its technical report, highlighting the efficient integration of sparse MoE reinforcement learning post-training frameworks [6][7]. Group 2: Challenges in Current Training Processes - Six main challenges were identified in the current MoE pre-training and reinforcement learning post-training processes, including difficulties in parallel strategy configuration, communication bottlenecks, uneven system load distribution, excessive operator scheduling overhead, complex training process management, and limitations in large-scale expansion [10][11]. Group 3: Solutions to Enhance Training Efficiency - Huawei proposed a complete end-to-end solution to address these challenges, focusing on enhancing training cluster utilization through intelligent parallel strategy selection, deep integration of computation and communication, and global dynamic load balancing [12][14]. - The first strategy involved optimizing parallel configurations, achieving a deployment that included 16 pipeline parallelism, 8 tensor parallelism, and 32 expert parallelism [15][16]. - The second strategy focused on releasing computing power at the single-node level, doubling the micro-batch size (MBS) and optimizing operator scheduling to fully utilize Ascend node capabilities [20][21]. Group 4: Reinforcement Learning Innovations - Huawei introduced the RL Fusion training and inference co-card technology, which supports flexible deployment modes and achieves a doubling of cluster utilization in post-training [28][29]. - The design of a semi-asynchronous mechanism, StaleSync, allows different tasks to execute in parallel while maintaining model accuracy, resulting in a 50% increase in overall training throughput [30]. Group 5: Performance Metrics and Future Prospects - The Pangu Ultra MoE model, with 718 billion parameters, demonstrated high performance during training, achieving a model utilization rate of 41% and a throughput of 35K Tokens/s in post-training [35][36]. - The system is designed to support ultra-large-scale clusters and models, with expectations for future iterations to achieve even higher utilization rates [35][36].
不用GPU,大模型每2秒吃透一道高数大题!这就是华为的实力
雷峰网·2025-05-30 09:48