Workflow
故障容错
icon
Search documents
让算力航母稳健远航,华为首次披露昇腾算力基础设施的压舱石
21世纪经济报道· 2025-06-09 12:08
Core Viewpoint - The article discusses the advancements in AI computing clusters, emphasizing their critical role in enhancing the capabilities of AI models through innovative engineering solutions and fault tolerance mechanisms [1]. Group 1: Supernode High Availability - AI training and inference require continuous operation, with each computer in the cluster having a backup to ensure seamless task execution during failures [1]. - Huawei's fault tolerance solutions include system-level, business-level, and operational-level strategies to manage faults gracefully [1]. Group 2: Cluster Linearity - The ideal scenario for computing clusters is linear scalability, where the performance increases proportionally with the number of computers [1]. - Huawei employs advanced task allocation algorithms and technologies to achieve high linearity in model training, with results showing linearity rates of 96% for various configurations [1]. Group 3: Rapid Recovery in Large-Scale Training - The system can automatically save training progress, allowing for quick recovery from failures without starting over [1]. - Innovations include process-level rescheduling and online recovery techniques that significantly reduce recovery times to under 3 minutes [1]. Group 4: Large-Scale MoE Model Inference Recovery - The article outlines a three-tier fault tolerance strategy for large-scale MoE model inference, minimizing user impact during hardware failures [1]. - Techniques such as rapid instance restart and token-level retries have been validated to reduce recovery times significantly [1]. Group 5: Fault Management and Diagnostic Awareness - A real-time monitoring system continuously tracks the health of each computer in the cluster, enabling quick fault detection and diagnosis [1]. - Huawei's comprehensive fault management solutions enhance reliability through advanced diagnostic capabilities and proactive maintenance strategies [1]. Group 6: Simulation Modeling - The article introduces a Markov modeling simulation platform that allows for pre-testing of AI models in a virtual environment, identifying potential bottlenecks before real-world deployment [1]. - This approach optimizes resource allocation and enhances the overall efficiency of the computing cluster [1]. Group 7: Framework Migration - Huawei's MindSpore framework supports seamless integration with mainstream ecosystems, facilitating the deployment of large models and improving inference performance [1]. - The framework includes tools for adapting third-party frameworks, ensuring compatibility and efficiency in AI model training and inference [1].
华为昇腾万卡集群揭秘:如何驯服AI算力「巨兽」?
机器之心· 2025-06-09 04:33
Core Viewpoint - The article discusses the advancements in AI computing power clusters, highlighting their critical role in supporting large-scale AI models and ensuring high availability, fault tolerance, and efficient resource management [2][4][39]. Group 1: High Availability of Super Nodes - AI training and inference require continuous operation, similar to an emergency system in hospitals, where each computer in the cluster has a backup to take over in case of failure, ensuring uninterrupted tasks [6][5]. - Huawei's CloudMatrix 384 super node employs a fault tolerance scheme that includes system-level, business-level, and operational-level fault tolerance, transforming faults into manageable issues [7][8]. Group 2: Cluster Linearity - The ideal scenario for computing power clusters is linear scalability, where the total power of 100 computers should be 100 times that of one, achieved through precise task allocation algorithms [10]. - Huawei's team has developed key technologies to enhance training linearity for large models, achieving linearity rates of 96% for the Pangu Ultra 135B model with 4K cards [11][13]. Group 3: Rapid Recovery in Large-Scale Training - When training with thousands of computing units, the system can automatically save progress, allowing for quick recovery from faults without starting over, significantly reducing downtime [14][15]. - Innovations such as process-level rescheduling and online recovery techniques have been introduced to minimize recovery times to under 3 minutes and even 30 seconds for specific faults [16][20]. Group 4: Fault Management and Diagnosis - A real-time monitoring system continuously checks the health of each computer in the cluster, enabling quick identification and resolution of issues before they escalate [24][26]. - Huawei has developed a comprehensive fault management framework that includes capabilities for error detection, isolation, and recovery, enhancing the reliability of the computing infrastructure [24][28]. Group 5: Simulation and Modeling - Before deploying complex AI models, the computing cluster can simulate scenarios in a virtual environment to identify potential bottlenecks and optimize resource allocation [29][30]. - The introduction of a Markov modeling simulation platform allows for multi-dimensional analysis and performance prediction, improving resource efficiency and system stability [30][31]. Group 6: Framework Migration - Huawei's MindSpore framework has rapidly evolved since its open-source launch, providing tools for seamless migration from other frameworks and enhancing performance during training and inference [37][38]. - The framework supports a wide range of applications, enabling quick deployment of large models and improving inference capabilities [38][39].