昇腾AI算力集群有多稳？万卡可用度98%，秒级恢复故障不用愁

Core Viewpoint - The article emphasizes the importance of high availability in AI computing clusters, likening them to a factory production line that must operate without interruption to support the demands of AI applications [1][8]. Group 1: High Availability as a Core Foundation - High availability is crucial for AI computing clusters to ensure continuous operation and reliability, allowing AI to drive business innovation effectively [1]. - The complexity of fault diagnosis in large AI clusters is highlighted, with current fault localization taking hours to days, necessitating advanced observability capabilities [2][3]. - Huawei's team has developed a comprehensive reliability analysis model for AI clusters, achieving a mean time between failures (MTBF) of over 24 hours for hardware reliability [3]. Group 2: Fault Tolerance and Recovery Mechanisms - Huawei proposes a multi-layered fault tolerance solution for supernodes, achieving a fault tolerance rate of over 99% for optical modules through various advanced techniques [4]. - The training recovery time for large AI clusters has been significantly reduced to under 10 minutes, with process-level recovery times further optimized to as low as 30 seconds [6]. - A three-tiered fault tolerance strategy has been introduced for large-scale inference architectures, minimizing user impact during failures [7]. Group 3: Innovations Supporting High Availability - Six innovative solutions have been proposed to enhance the high availability of AI computing clusters, including fault perception, management, and optical link fault tolerance [8]. - The availability of large AI clusters has reached 98%, with training and inference recovery times achieving second-level speeds and linearity exceeding 95% [8]. - Future exploration will focus on diverse application scenarios, new architectural breakthroughs, and intelligent autonomous maintenance [8].