MoGE架构

Search documents
一张卡干俩活,华为要把算力榨干
虎嗅APP· 2025-06-05 14:24
Core Viewpoint - The article discusses the advancements in AI, particularly focusing on Huawei's innovations in the MoE (Mixture of Experts) architecture and the introduction of RL (Reinforcement Learning) post-training techniques, which aim to enhance the efficiency and performance of large language models (LLMs) in the competitive AI landscape [1][3]. Group 1: MoE Architecture and Huawei's Innovations - The MoE model, originally proposed by Canadian scholars, has evolved significantly, with Huawei introducing the MoGE architecture that addresses inefficiencies in the traditional MoE model, leading to cost reduction and improved training and deployment [1]. - Huawei's approach emphasizes the importance of creating a collaborative ecosystem to foster the growth of the Ascend ecosystem in China [1]. Group 2: RL Post-Training Techniques - RL post-training has emerged as a critical pathway to enhance LLM performance, with models like OpenAI's o1 and DeepSeek-R1 leveraging this technique to improve reasoning capabilities in complex tasks [3][5]. - The RL post-training phase currently consumes 20% of the total computational resources, projected to rise to 50%, significantly impacting model performance and costs [3]. Group 3: Challenges in RL Post-Training - The traditional On-Policy algorithms create a "computational black hole" due to the alternating execution of training and inference tasks, leading to underutilization of resources [6][7]. - The complexity of task scheduling in large-scale clusters, exacerbated by the adoption of various parallel strategies, poses significant challenges for efficient resource utilization [8]. Group 4: Innovations in Resource Utilization - Huawei's RL Fusion technology allows a single card to handle both training and inference tasks simultaneously, effectively doubling resource utilization and throughput [9][10]. - The StaleSync mechanism enables near-asynchronous execution of tasks, achieving over 90% efficiency in horizontal scaling across CloudMatrix 384 super nodes [16][20]. Group 5: Performance Metrics and Results - The combination of RL Fusion and StaleSync has led to a significant increase in efficiency, with single-node throughput improving by 78.5% and overall performance enhancement of 1.5 times [30][31]. - StaleSync's implementation in cluster scaling shows a linear throughput increase from 35k tokens/s to 127k tokens/s as the number of super nodes increases, demonstrating its effectiveness in enhancing scalability [32]. Group 6: Conclusion - The advancements in RL post-training techniques by Huawei represent a significant leap in AI efficiency, positioning the company as a key player in the next generation of AI technology [33].
爆改大模型训练,华为打出昇腾+鲲鹏组合拳
虎嗅APP· 2025-06-04 10:35
Core Viewpoint - The article discusses Huawei's advancements in AI training, particularly through the optimization of the Mixture of Experts (MoE) model architecture, which enhances efficiency and reduces costs in AI model training [1][34]. Group 1: MoE Model and Its Challenges - The MoE model has become a preferred path for tech giants in developing stronger AI systems, with its unique architecture addressing the computational bottlenecks of large-scale model training [2]. - Huawei has identified two main challenges in improving single-node training efficiency: low operator computation efficiency and insufficient NPU memory [6][7]. Group 2: Enhancements in Training Efficiency - Huawei's collaboration between Ascend and Kunpeng has significantly improved training operator computation efficiency and memory utilization, achieving a 20% increase in throughput and a 70% reduction in memory usage [3][18]. - The article highlights three optimization strategies for core operators in MoE models: "Slimming Technique" for FlashAttention, "Balancing Technique" for MatMul, and "Transport Technique" for Vector operators, leading to a 15% increase in overall training throughput [9][10][13]. Group 3: Operator Dispatch Optimization - The article details how Huawei's optimizations have led to nearly zero waiting time for operator dispatch, enhancing the utilization of computational power [19][25]. - The Selective R/S memory optimization technique allows for a 70% reduction in memory for activation values during training, showcasing Huawei's innovative approach to memory management [26][34]. Group 4: Industry Implications - Huawei's advancements in AI training not only clear obstacles for large-scale MoE model training but also provide valuable reference paths for the industry, demonstrating the company's deep technical accumulation in AI computing [34].