Workflow
敢说永不掉线、秒级恢复,华为的底气是什么?
虎嗅APP·2025-06-10 10:18

Core Viewpoint - The article discusses the importance of achieving high availability in AI computing clusters, emphasizing the need for robust fault detection, management, and recovery systems to ensure continuous operation and efficiency in AI applications [1][3]. Group 1: High Availability Core Foundation - AI computing clusters face complex fault localization challenges due to large system scales and intricate technology stacks, requiring advanced fault detection and diagnosis capabilities [5]. - Huawei has developed a comprehensive observability capability for large-scale clusters, which includes various monitoring and diagnostic tools to enhance operational efficiency [5][6]. - The company aims to achieve a mean time between failures (MTBF) of over 24 hours for its CloudMatrix supernode clusters, significantly improving hardware reliability [6]. Group 2: High Availability Supporting Business - Huawei's innovative technologies, such as TACO and NSF, have improved the linearity of training tasks, allowing for efficient scaling of AI models [8][11]. - The training recovery time for large AI clusters has been optimized to under 10 minutes, with advanced techniques enabling recovery times as low as 30 seconds [12][14]. - A three-tier fault tolerance scheme has been proposed to address reliability issues in large-scale inference architectures, minimizing user impact during hardware failures [16]. Group 3: Future Directions - Huawei plans to explore new applications driven by diverse and complex scenarios, breakthroughs in heterogeneous integration, and the development of intelligent autonomous maintenance systems [18].