Workflow
高可用性
icon
Search documents
昇腾 AI 算力集群有多稳?万卡可用度 98%,秒级恢复故障不用愁
第一财经· 2025-06-10 11:25
Core Viewpoint - The article emphasizes the importance of high availability in AI computing clusters, likening them to a "digital engine" that must operate continuously without interruptions to support business innovation and efficiency [1][12]. Group 1: High Availability and Fault Management - AI computing clusters face complex fault localization challenges due to their large scale and intricate technology stack, with current fault diagnosis taking from hours to days [2]. - Huawei's team has developed a comprehensive observability capability to enhance fault detection and management, which includes cluster operation views, alarm views, and network link monitoring [2][12]. - The average AI cluster experiences multiple faults daily, significantly impacting training efficiency and wasting computing resources [2]. Group 2: Reliability and Performance Enhancements - Huawei's reliability analysis model aims to improve the mean time between failures (MTBF) for large-scale clusters to over 24 hours [3]. - The introduction of a multi-layer protection system and software fault tolerance solutions has achieved a fault tolerance rate of over 99% for optical modules [3]. - Training efficiency has been enhanced, with linearity metrics showing 96% for dense models and 95.05% for sparse models under specific configurations [6]. Group 3: Fast Recovery Mechanisms - Huawei has implemented a multi-tiered fault recovery system that significantly reduces training recovery times to under 10 minutes, with process-level recovery achieving as low as 30 seconds [9][10]. - The introduction of instance-level recovery techniques has compressed recovery times to under 5 minutes, minimizing user impact during faults [10]. Group 4: Future Directions and Innovations - Huawei's six innovative solutions for high availability include fault perception and diagnosis, fault management, and optical link fault tolerance, which have led to a cluster availability rate of 98% [12]. - Future explorations will focus on diverse application scenarios, heterogeneous integration, and intelligent autonomous maintenance to drive further innovations in AI computing clusters [12].