昇腾 AI 算力集群有多稳？万卡可用度 98%，秒级恢复故障不用愁

Core Viewpoint - The article discusses how Huawei enhances the efficiency and stability of AI computing clusters, emphasizing the importance of high availability to support continuous operation and minimize downtime in AI applications [2][16]. Group 1: High Availability Core Infrastructure - AI computing clusters face complex fault diagnosis challenges due to large system scale and intricate technology stacks, with fault localization taking from hours to days [4]. - Huawei has developed a full-stack observability capability to improve fault detection and management, which includes a fault mode library and cross-domain fault diagnosis [4]. - The CloudMatrix super node achieves a mean time between failures (MTBF) of over 24 hours, significantly enhancing hardware reliability [4]. Group 2: Fault Tolerance and Reliability - Huawei's super node architecture leverages optical link software fault tolerance solutions, achieving a fault tolerance rate of over 99% for optical module failures [5][6]. - The recovery time for high-bandwidth memory (HBM) multi-bit ECC faults has been reduced to 1 minute, resulting in a 5% decrease in computing power loss due to faults [6]. Group 3: Training and Inference Efficiency - The linearity metric measures the improvement in training task speed relative to the number of computing cards, with Huawei achieving a linearity of 96% for the Pangu Ultra 135B model using a 4K card setup [10]. - Huawei's training recovery system can restore training tasks in under 10 minutes, with process-level recovery reducing this to as low as 30 seconds [12]. - For large EP inference architectures, Huawei has proposed a three-tier fault tolerance solution to minimize user impact during hardware failures [12][14]. Group 4: Future Directions - Huawei aims to explore new applications driven by diverse and complex scenarios, breakthroughs in heterogeneous integration, and innovative engineering paradigms focused on observability and intelligent autonomy [16].