Workflow
Atlas 800T A2集群
icon
Search documents
昇腾 AI 算力集群有多稳?万卡可用度 98%,秒级恢复故障不用愁
雷峰网· 2025-06-10 10:30
Core Viewpoint - The article discusses how Huawei enhances the efficiency and stability of AI computing clusters, emphasizing the importance of high availability to support continuous operation and minimize downtime in AI applications [2][16]. Group 1: High Availability Core Infrastructure - AI computing clusters face complex fault diagnosis challenges due to large system scale and intricate technology stacks, with fault localization taking from hours to days [4]. - Huawei has developed a full-stack observability capability to improve fault detection and management, which includes a fault mode library and cross-domain fault diagnosis [4]. - The CloudMatrix super node achieves a mean time between failures (MTBF) of over 24 hours, significantly enhancing hardware reliability [4]. Group 2: Fault Tolerance and Reliability - Huawei's super node architecture leverages optical link software fault tolerance solutions, achieving a fault tolerance rate of over 99% for optical module failures [5][6]. - The recovery time for high-bandwidth memory (HBM) multi-bit ECC faults has been reduced to 1 minute, resulting in a 5% decrease in computing power loss due to faults [6]. Group 3: Training and Inference Efficiency - The linearity metric measures the improvement in training task speed relative to the number of computing cards, with Huawei achieving a linearity of 96% for the Pangu Ultra 135B model using a 4K card setup [10]. - Huawei's training recovery system can restore training tasks in under 10 minutes, with process-level recovery reducing this to as low as 30 seconds [12]. - For large EP inference architectures, Huawei has proposed a three-tier fault tolerance solution to minimize user impact during hardware failures [12][14]. Group 4: Future Directions - Huawei aims to explore new applications driven by diverse and complex scenarios, breakthroughs in heterogeneous integration, and innovative engineering paradigms focused on observability and intelligent autonomy [16].
敢说永不掉线、秒级恢复,华为的底气是什么?
虎嗅APP· 2025-06-10 10:18
Core Viewpoint - The article discusses the importance of achieving high availability in AI computing clusters, emphasizing the need for robust fault detection, management, and recovery systems to ensure continuous operation and efficiency in AI applications [1][3]. Group 1: High Availability Core Foundation - AI computing clusters face complex fault localization challenges due to large system scales and intricate technology stacks, requiring advanced fault detection and diagnosis capabilities [5]. - Huawei has developed a comprehensive observability capability for large-scale clusters, which includes various monitoring and diagnostic tools to enhance operational efficiency [5][6]. - The company aims to achieve a mean time between failures (MTBF) of over 24 hours for its CloudMatrix supernode clusters, significantly improving hardware reliability [6]. Group 2: High Availability Supporting Business - Huawei's innovative technologies, such as TACO and NSF, have improved the linearity of training tasks, allowing for efficient scaling of AI models [8][11]. - The training recovery time for large AI clusters has been optimized to under 10 minutes, with advanced techniques enabling recovery times as low as 30 seconds [12][14]. - A three-tier fault tolerance scheme has been proposed to address reliability issues in large-scale inference architectures, minimizing user impact during hardware failures [16]. Group 3: Future Directions - Huawei plans to explore new applications driven by diverse and complex scenarios, breakthroughs in heterogeneous integration, and the development of intelligent autonomous maintenance systems [18].