昇腾万卡集群

Search documents
华为昇腾万卡集群揭秘:如何驯服AI算力「巨兽」?
雷峰网· 2025-06-09 13:37
Core Viewpoint - The article discusses the advancements in AI computing clusters, particularly focusing on Huawei's innovations in ensuring high availability, linear scalability, rapid recovery, and fault tolerance in large-scale AI model training and inference systems [3][25]. Group 1: High Availability of Super Nodes - AI training and inference require continuous operation, similar to an emergency room, where each computer in the cluster has a backup to take over in case of failure, ensuring uninterrupted tasks [5][6]. - Huawei's CloudMatrix 384 super node employs a fault tolerance strategy that includes system-level, business-level, and operational-level fault management to convert faults into manageable issues [5][6]. Group 2: Linear Scalability - The ideal scenario for computing power is linear scalability, where 100 computers should provide 100 times the power of one. Huawei's task distribution algorithms ensure efficient collaboration among computers, enhancing performance as the number of machines increases [8]. - Key technologies such as TACO, NSF, NB, and AICT have been developed to improve the linearity of training large models, achieving linearity rates of 96% and above in various configurations [8]. Group 3: Rapid Recovery of Training - The system can quickly recover from failures during training by automatically saving progress, allowing it to resume from the last checkpoint rather than starting over [10][12]. - Innovations like process-level rescheduling and online recovery techniques have reduced recovery times to under 3 minutes and even as low as 30 seconds in some cases [12]. Group 4: Fault Tolerance in MoE Model Inference - The article outlines a three-tier fault tolerance strategy for large-scale MoE model inference, which minimizes user impact during hardware failures [14][15]. - Techniques such as instance-level rapid restart and token-level retries have significantly reduced recovery times from 20 minutes to as low as 5 minutes [15]. Group 5: Fault Management and Diagnostic Capabilities - A real-time monitoring system continuously checks the health of each computer in the cluster, allowing for quick identification and resolution of issues [16]. - Huawei's comprehensive fault management solution includes capabilities for error detection, isolation, and recovery, enhancing the reliability of the computing cluster [16]. Group 6: Simulation and Modeling - Before training complex AI models, the computing cluster can simulate various scenarios in a virtual environment to identify potential bottlenecks and optimize performance [19][20]. - The introduction of a Markov modeling simulation platform allows for efficient resource allocation and performance tuning, improving throughput and reducing communication delays [20][21]. Group 7: Framework Migration - Huawei's MindSpore framework has rapidly evolved since its open-source launch, providing tools for seamless migration from other frameworks and enhancing execution efficiency [23]. - The framework supports one-click deployment for large models, significantly improving inference performance [23]. Group 8: Future Outlook - The article concludes that the evolution of computing infrastructure will follow a collaborative path between algorithms, computing power, and engineering capabilities, potentially creating a closed loop of innovation driven by application demands [25].
华为昇腾万卡集群揭秘:如何驯服AI算力「巨兽」?
机器之心· 2025-06-09 04:33
Core Viewpoint - The article discusses the advancements in AI computing power clusters, highlighting their critical role in supporting large-scale AI models and ensuring high availability, fault tolerance, and efficient resource management [2][4][39]. Group 1: High Availability of Super Nodes - AI training and inference require continuous operation, similar to an emergency system in hospitals, where each computer in the cluster has a backup to take over in case of failure, ensuring uninterrupted tasks [6][5]. - Huawei's CloudMatrix 384 super node employs a fault tolerance scheme that includes system-level, business-level, and operational-level fault tolerance, transforming faults into manageable issues [7][8]. Group 2: Cluster Linearity - The ideal scenario for computing power clusters is linear scalability, where the total power of 100 computers should be 100 times that of one, achieved through precise task allocation algorithms [10]. - Huawei's team has developed key technologies to enhance training linearity for large models, achieving linearity rates of 96% for the Pangu Ultra 135B model with 4K cards [11][13]. Group 3: Rapid Recovery in Large-Scale Training - When training with thousands of computing units, the system can automatically save progress, allowing for quick recovery from faults without starting over, significantly reducing downtime [14][15]. - Innovations such as process-level rescheduling and online recovery techniques have been introduced to minimize recovery times to under 3 minutes and even 30 seconds for specific faults [16][20]. Group 4: Fault Management and Diagnosis - A real-time monitoring system continuously checks the health of each computer in the cluster, enabling quick identification and resolution of issues before they escalate [24][26]. - Huawei has developed a comprehensive fault management framework that includes capabilities for error detection, isolation, and recovery, enhancing the reliability of the computing infrastructure [24][28]. Group 5: Simulation and Modeling - Before deploying complex AI models, the computing cluster can simulate scenarios in a virtual environment to identify potential bottlenecks and optimize resource allocation [29][30]. - The introduction of a Markov modeling simulation platform allows for multi-dimensional analysis and performance prediction, improving resource efficiency and system stability [30][31]. Group 6: Framework Migration - Huawei's MindSpore framework has rapidly evolved since its open-source launch, providing tools for seamless migration from other frameworks and enhancing performance during training and inference [37][38]. - The framework supports a wide range of applications, enabling quick deployment of large models and improving inference capabilities [38][39].