Workflow
MoE模型训练
icon
Search documents
华为揭秘:国产昇腾训出世界一流大模型
Guan Cha Zhe Wang· 2025-05-30 08:35
Core Insights - Huawei has launched a new model called Pangu Ultra MoE with a parameter scale of 718 billion, marking a significant advancement in MoE model training on the Ascend AI computing platform [1][3] - The Pangu team has innovated in model architecture and training methods to ensure stable training of ultra-large and highly sparse MoE models, overcoming challenges typically associated with such training processes [1][2] - The release of Pangu Ultra MoE and Pangu Pro MoE series models demonstrates Huawei's capability in achieving a fully autonomous training process with domestic computing power and models, reinforcing the innovation capacity of China's AI infrastructure [3] Model Architecture - The Pangu team introduced the Depth-Scaled Sandwich-Norm (DSSN) stable architecture and TinyInit initialization method, enabling long-term stable training with over 18TB of data on the Ascend platform [1] - The EP loss load optimization method was developed to maintain load balancing among experts and enhance their specialization capabilities [1] - The Pangu Ultra MoE employs advanced MLA and MTP architectures, utilizing a Dropless training strategy during both pre-training and post-training phases to balance model performance and efficiency [1] Training Methods - Huawei's team has disclosed key technologies that enable efficient integration of large sparse MoE reinforcement learning (RL) post-training frameworks on the Ascend CloudMatrix 384 supernodes, marking a transition to supernode cluster training [2] - Recent upgrades to the pre-training system have improved the performance of the MFU in a 10,000-card cluster from 30% to 41% [2] - The recently released Pangu Pro MoE model, with 72 billion parameters and 16 billion active parameters, showcases excellent performance through innovative dynamic expert network activation, rivaling the performance of models with over 100 billion parameters [2]