Workflow
Atlas 800T A2
icon
Search documents
昇腾 AI 算力集群有多稳?万卡可用度 98%,秒级恢复故障不用愁
第一财经· 2025-06-10 11:25
Core Viewpoint - The article emphasizes the importance of high availability in AI computing clusters, likening them to a "digital engine" that must operate continuously without interruptions to support business innovation and efficiency [1][12]. Group 1: High Availability and Fault Management - AI computing clusters face complex fault localization challenges due to their large scale and intricate technology stack, with current fault diagnosis taking from hours to days [2]. - Huawei's team has developed a comprehensive observability capability to enhance fault detection and management, which includes cluster operation views, alarm views, and network link monitoring [2][12]. - The average AI cluster experiences multiple faults daily, significantly impacting training efficiency and wasting computing resources [2]. Group 2: Reliability and Performance Enhancements - Huawei's reliability analysis model aims to improve the mean time between failures (MTBF) for large-scale clusters to over 24 hours [3]. - The introduction of a multi-layer protection system and software fault tolerance solutions has achieved a fault tolerance rate of over 99% for optical modules [3]. - Training efficiency has been enhanced, with linearity metrics showing 96% for dense models and 95.05% for sparse models under specific configurations [6]. Group 3: Fast Recovery Mechanisms - Huawei has implemented a multi-tiered fault recovery system that significantly reduces training recovery times to under 10 minutes, with process-level recovery achieving as low as 30 seconds [9][10]. - The introduction of instance-level recovery techniques has compressed recovery times to under 5 minutes, minimizing user impact during faults [10]. Group 4: Future Directions and Innovations - Huawei's six innovative solutions for high availability include fault perception and diagnosis, fault management, and optical link fault tolerance, which have led to a cluster availability rate of 98% [12]. - Future explorations will focus on diverse application scenarios, heterogeneous integration, and intelligent autonomous maintenance to drive further innovations in AI computing clusters [12].
华为昇腾产业链
是说芯语· 2025-05-17 14:08
Core Viewpoint - The article discusses the growth and investment opportunities in the AI computing center market in China, particularly focusing on the Huawei Ascend ecosystem and its associated companies across four key areas: complete machines, power supply, cooling, and connectivity [2]. Group 1: Complete Machines - The newly added computing power in 2024 is expected to reach approximately 20,000 PFlops, with the investment scale of China's intelligent computing center market projected to reach 288.6 billion yuan by 2028. In 2023, the market size was 87.9 billion yuan, showing a year-on-year growth of over 90% [3]. - As of August 2024, there are over 300 intelligent computing center projects in China, with a total announced computing power exceeding 500,000 PFlops. About one-third of these projects are planned to have a computing power greater than 500 PFlops, mainly funded by government or telecom operators [3]. Group 2: Power Supply - AI servers utilize three power supply methods: external cabinets, racks, and trays. The power supply unit (PSU) converts high-voltage AC from the grid to 48V DC, which is then further converted to 12V for CPUs and 0.8V for GPUs [15]. - The GB200 NVL72 cabinet is equipped with 48 5.5kW PSUs, providing a total power of 132kW. The increasing power demand in AI servers is expected to expand the AI power supply market [16][21]. Group 3: Cooling - The power consumption of single cabinets has increased from 4-6 kW in traditional computing centers to 20-40 kW or higher in intelligent computing centers. Liquid cooling technology is becoming the preferred choice due to its efficiency and low energy consumption [27]. - The market size for liquid cooling data centers in China was 8.63 billion yuan in 2023, with a growth rate of 26.2%, expected to reach 18.01 billion yuan by 2026 [29]. Group 4: Connectivity - Backplane connectors are crucial for high-performance servers and communication devices, supporting high-speed data transmission and ensuring signal integrity [38]. - The Chinese communication connector market is projected to grow at a compound annual growth rate of 30%-35%, with expectations to exceed 60 billion yuan by 2025, where AI-related connectors will account for over 70% of the market [40].