昇腾(Ascend)神经处理单元(NPU)P910C

Search documents
华为CloudMatrix 384与英伟达NVL72对比
半导体行业观察· 2025-07-30 02:18
Core Viewpoint - Nvidia has been authorized to resume exports of its H20 GPU to China, but Huawei's CloudMatrix 384 system, showcased at the World Artificial Intelligence Conference, presents a formidable alternative with superior specifications [3][4]. Summary by Sections Nvidia H20 GPU and Huawei's CloudMatrix 384 - Nvidia's H20 GPU may have sufficient supply, but operators in China now have stronger alternatives, particularly Huawei's CloudMatrix 384 system, which features the Ascend P910C NPU [3]. - The Ascend P910C promises over twice the floating-point performance of the H20 and has a larger memory capacity, despite being slower [3][6]. Technical Specifications of Ascend P910C - Each Ascend P910C accelerator is equipped with two computing chips, achieving a combined performance of 752 teraFLOPS for dense FP16/BF16 tasks, supported by 128GB of high-bandwidth memory [4]. - The CloudMatrix 384 system is significantly larger than Nvidia's systems, with the ability to scale up to 384 NPUs, compared to Nvidia's maximum of 72 GPUs [11][9]. Performance Comparison - In terms of memory bandwidth and floating-point performance, the Ascend P910C outperforms Nvidia's H20, with 128GB of HBM compared to H20's 96GB [6]. - Huawei's CloudMatrix system can support up to 165,000 NPUs in a training cluster, showcasing its scalability [11]. Inference Performance - Huawei's CloudMatrix-Infer platform enhances inference throughput, allowing each NPU to process 6,688 input tokens per second, outperforming Nvidia's H800 in terms of efficiency [14]. - The architecture allows for high-bandwidth, unified access to cached data, improving task scheduling and cache efficiency [13]. Power, Density, and Cost - The estimated total power consumption of the CloudMatrix 384 system is around 600 kW, significantly higher than Nvidia's NVL72 at approximately 120 kW [15]. - The cost of Huawei's CloudMatrix 384 is around $8.2 million, while Nvidia's NVL72 is estimated at $3.5 million, raising questions about deployment and operational costs [16]. Market Dynamics - Nvidia has reportedly ordered an additional 300,000 H20 chips from TSMC to meet strong demand from Chinese customers, indicating ongoing competition in the AI accelerator market [17].