华为的准万亿大模型，是如何训练的？

Core Viewpoint - The article discusses Huawei's advancements in AI training systems, particularly focusing on the MoE (Mixture of Experts) architecture and its optimization through the MoGE (Mixture of Generalized Experts) framework, which enhances efficiency and reduces costs in AI model training [1][2]. Summary by Sections Introduction to MoE and Huawei's Innovations - The MoE model, initially proposed by Canadian scholars, has evolved significantly, with Huawei now optimizing this architecture to address inefficiencies and cost issues [1]. - Huawei's MoGE architecture aims to create a more balanced and efficient training environment for AI models, contributing to the ongoing AI competition [1]. Performance Metrics and Achievements - Huawei's training system, utilizing the "昇腾+Pangu Ultra MoE" combination, has achieved significant performance metrics, including a 41% MFU (Model Floating Utilization) during pre-training and a throughput of 35K Tokens/s during post-training on the CloudMatrix 384 super node [2][26][27]. Challenges in MoE Training - Six main challenges in MoE training processes are identified: difficulty in parallel strategy configuration, All-to-All communication bottlenecks, uneven system load distribution, excessive operator scheduling overhead, complex training process management, and limitations in large-scale expansion [3][4]. Solutions and Innovations - First Strategy: Enhancing Training Cluster Utilization - Huawei implemented intelligent parallel strategy selection and global dynamic load balancing to improve overall training efficiency [6][11]. - A modeling simulation framework was developed to automate the selection of optimal parallel configurations for the Pangu Ultra MoE model [7]. - Second Strategy: Releasing Computing Power of Single Nodes - The focus shifted to optimizing operator computation efficiency, achieving a twofold increase in micro-batch size (MBS) and reducing host-bound issues to below 2% [15][16][17]. - Third Strategy: High-Performance Scalable RL Post-Training Technologies - The introduction of RL Fusion technology allows for flexible deployment modes and significantly improves resource utilization during post-training [19][21]. - The system's design enables a 50% increase in overall training throughput while maintaining model accuracy [21]. Technical Specifications of Pangu Ultra MoE - The Pangu Ultra MoE model features 718 billion parameters, with a structure that includes 61 layers of Transformer architecture, achieving high performance and scalability [26]. - The training utilized a large-scale cluster of 6K - 10K cards, demonstrating strong generalization capabilities and efficient scaling potential [26][27].