Workflow
AI算力集群高可用性
icon
Search documents
昇腾AI算力集群有多稳?万卡可用度98%,秒级恢复故障不用愁
21世纪经济报道· 2025-06-10 12:55
Core Viewpoint - The article emphasizes the importance of high availability in AI computing clusters, likening them to a factory production line that must operate without interruption to support the demands of AI applications [1][8]. Group 1: High Availability as a Core Foundation - High availability is crucial for AI computing clusters to ensure continuous operation and reliability, allowing AI to drive business innovation effectively [1]. - The complexity of fault diagnosis in large AI clusters is highlighted, with current fault localization taking hours to days, necessitating advanced observability capabilities [2][3]. - Huawei's team has developed a comprehensive reliability analysis model for AI clusters, achieving a mean time between failures (MTBF) of over 24 hours for hardware reliability [3]. Group 2: Fault Tolerance and Recovery Mechanisms - Huawei proposes a multi-layered fault tolerance solution for supernodes, achieving a fault tolerance rate of over 99% for optical modules through various advanced techniques [4]. - The training recovery time for large AI clusters has been significantly reduced to under 10 minutes, with process-level recovery times further optimized to as low as 30 seconds [6]. - A three-tiered fault tolerance strategy has been introduced for large-scale inference architectures, minimizing user impact during failures [7]. Group 3: Innovations Supporting High Availability - Six innovative solutions have been proposed to enhance the high availability of AI computing clusters, including fault perception, management, and optical link fault tolerance [8]. - The availability of large AI clusters has reached 98%, with training and inference recovery times achieving second-level speeds and linearity exceeding 95% [8]. - Future exploration will focus on diverse application scenarios, new architectural breakthroughs, and intelligent autonomous maintenance [8].
昇腾 AI 算力集群有多稳?万卡可用度 98%,秒级恢复故障不用愁
雷峰网· 2025-06-10 10:30
Core Viewpoint - The article discusses how Huawei enhances the efficiency and stability of AI computing clusters, emphasizing the importance of high availability to support continuous operation and minimize downtime in AI applications [2][16]. Group 1: High Availability Core Infrastructure - AI computing clusters face complex fault diagnosis challenges due to large system scale and intricate technology stacks, with fault localization taking from hours to days [4]. - Huawei has developed a full-stack observability capability to improve fault detection and management, which includes a fault mode library and cross-domain fault diagnosis [4]. - The CloudMatrix super node achieves a mean time between failures (MTBF) of over 24 hours, significantly enhancing hardware reliability [4]. Group 2: Fault Tolerance and Reliability - Huawei's super node architecture leverages optical link software fault tolerance solutions, achieving a fault tolerance rate of over 99% for optical module failures [5][6]. - The recovery time for high-bandwidth memory (HBM) multi-bit ECC faults has been reduced to 1 minute, resulting in a 5% decrease in computing power loss due to faults [6]. Group 3: Training and Inference Efficiency - The linearity metric measures the improvement in training task speed relative to the number of computing cards, with Huawei achieving a linearity of 96% for the Pangu Ultra 135B model using a 4K card setup [10]. - Huawei's training recovery system can restore training tasks in under 10 minutes, with process-level recovery reducing this to as low as 30 seconds [12]. - For large EP inference architectures, Huawei has proposed a three-tier fault tolerance solution to minimize user impact during hardware failures [12][14]. Group 4: Future Directions - Huawei aims to explore new applications driven by diverse and complex scenarios, breakthroughs in heterogeneous integration, and innovative engineering paradigms focused on observability and intelligent autonomy [16].
敢说永不掉线、秒级恢复,华为的底气是什么?
虎嗅APP· 2025-06-10 10:18
Core Viewpoint - The article discusses the importance of achieving high availability in AI computing clusters, emphasizing the need for robust fault detection, management, and recovery systems to ensure continuous operation and efficiency in AI applications [1][3]. Group 1: High Availability Core Foundation - AI computing clusters face complex fault localization challenges due to large system scales and intricate technology stacks, requiring advanced fault detection and diagnosis capabilities [5]. - Huawei has developed a comprehensive observability capability for large-scale clusters, which includes various monitoring and diagnostic tools to enhance operational efficiency [5][6]. - The company aims to achieve a mean time between failures (MTBF) of over 24 hours for its CloudMatrix supernode clusters, significantly improving hardware reliability [6]. Group 2: High Availability Supporting Business - Huawei's innovative technologies, such as TACO and NSF, have improved the linearity of training tasks, allowing for efficient scaling of AI models [8][11]. - The training recovery time for large AI clusters has been optimized to under 10 minutes, with advanced techniques enabling recovery times as low as 30 seconds [12][14]. - A three-tier fault tolerance scheme has been proposed to address reliability issues in large-scale inference architectures, minimizing user impact during hardware failures [16]. Group 3: Future Directions - Huawei plans to explore new applications driven by diverse and complex scenarios, breakthroughs in heterogeneous integration, and the development of intelligent autonomous maintenance systems [18].