华为AI实力！不用GPU，大模型每2秒吃透一道高数大题！

Core Viewpoint - Huawei has achieved significant advancements in training large models through its "Ascend + Pangu Ultra MoE" combination, enabling a fully controllable training process without the need for GPUs, showcasing industry-leading performance in cluster training systems [2][3]. Group 1: Technical Innovations - Huawei's training system has improved the model training efficiency significantly, with a pre-training model utilization rate (MFU) reaching 41% and a post-training throughput of 35K Tokens/s on the CloudMatrix 384 super node [3][34]. - The company has introduced a series of innovative solutions to address challenges in the MoE pre-training and reinforcement learning (RL) post-training processes, including intelligent parallel strategy selection and global dynamic load balancing [11][17]. - The training system utilizes a hierarchical All-to-All communication architecture to reduce communication overhead to nearly zero, enhancing the efficiency of expert parallel communication [14][15]. Group 2: Training Process Optimization - The training cluster's utilization has been optimized through a simulation-driven intelligent parallel optimization framework, which automates the selection of optimal deployment configurations [12][13]. - The team has implemented a memory optimization framework that achieves over 70% savings in activation memory, ensuring reliable long-term training even under increased memory pressure [25]. - The RL Fusion technology allows for flexible deployment modes, significantly improving resource scheduling during the inference phase and doubling the utilization rate in RL post-training [27][28]. Group 3: Model Specifications - The Pangu Ultra MoE model features 718 billion parameters, with a structure that includes 61 layers of Transformer architecture, designed for high sparsity and performance [32]. - The model's training utilized a cluster of 6K - 10K Ascend 800T A2 cards, achieving a high model utilization rate during the pre-training phase [32]. - The architecture supports efficient scaling to larger parameter models and clusters, with expectations of achieving an MFU greater than 50% in future iterations [32].