华为创造AI算力新纪录：万卡集群训练98%可用度，秒级恢复、分钟诊断

Core Viewpoint - The core capability of large models lies in stable performance output, which is fundamentally supported by powerful computing clusters. Building a computing cluster with tens of thousands of cards has become a globally recognized technical challenge [1]. Group 1: AI Computing Cluster Performance - Huawei's Ascend computing cluster can achieve near "never downtime" performance, which is essential for AI applications that require continuous operation [2][3]. - AI inference availability needs to reach a level of 99.95% to ensure reliability [5]. - Huawei has publicly shared the technology behind achieving high availability in AI computing clusters [6]. Group 2: Intelligent Insurance Systems - Huawei has developed three core capabilities to address the complex challenges faced by AI computing clusters, including full-stack observability, efficient fault diagnosis, and a self-healing system [8][12][13]. - Full-stack observability includes a monitoring system that ensures training availability of 98%, linearity over 95%, and quick recovery and diagnosis times [9][10]. - The fault diagnosis system consists of a fault mode library, cross-domain fault diagnosis, computing node fault diagnosis, and network fault diagnosis, significantly improving the efficiency of identifying issues [19][20]. Group 3: Recovery and Efficiency - Huawei's recovery system allows for rapid restoration of training tasks, with recovery times as short as 30 seconds for large-scale clusters [29][30]. - The training linearity for the Pangu Ultra 135B model reaches 96% with a 4K card cluster, indicating efficient resource utilization [24]. - The company has implemented advanced technologies such as TACO, NSF, NB, and AICT to optimize task distribution and communication within the cluster [31]. Group 4: AI Inference Stability - The new architecture for large models requires significantly more hardware, increasing the likelihood of faults, which can disrupt AI inference operations [32][33]. - Huawei has devised a three-step "insurance plan" to mitigate the impact of faults on AI inference, ensuring stable operations [34]. - The internal recovery technology can reduce recovery time to under 5 minutes, and a TOKEN-level retry technology can restore operations in less than 10 seconds, greatly enhancing system stability [35][36]. Group 5: Overall Innovation and Benefits - Huawei's innovative "3+3" dual-dimensional technical system includes fault perception and diagnosis, fault management, and cluster optical link fault tolerance, along with support capabilities for training and inference [37]. - These innovations have led to significant improvements, such as achieving a training availability of 98% for large clusters and rapid recovery capabilities [37].