华为CloudMatrix384超节点很强，但它的「灵魂」在云上

Core Viewpoint - The article emphasizes that the AI industry is transitioning into a new phase where system architecture and efficiency in communication are becoming more critical than just chip performance. This shift is highlighted by the introduction of Huawei's CloudMatrix384 super node, which aims to address the communication bottlenecks in AI data centers [1][4][80]. Group 1: AI Industry Trends - The AI competition has evolved from focusing solely on chip performance to a broader dimension of system architecture [2][80]. - The current bottleneck in AI data centers is the communication overhead during distributed training, leading to a significant drop in computing efficiency [4][80]. - A fundamental question arises: how to eliminate barriers between chips and create a seamless "computing highway" for AI workloads [5][80]. Group 2: Huawei's CloudMatrix384 - Huawei's CloudMatrix384 super node features 384 Ascend NPUs and 192 Kunpeng CPUs, designed to create a high-performance AI infrastructure [5][11]. - The architecture employs a fully peer-to-peer high-bandwidth interconnectivity and fine-grained resource disaggregation, aiming for a vision of "everything poolable, everything equal, everything combinable" [8][80]. - The introduction of a revolutionary internal network called "Unified Bus" allows for direct and high-speed communication between processors, significantly enhancing efficiency [13][15]. Group 3: Technical Innovations - CloudMatrix-Infer, a comprehensive LLM inference solution, is introduced alongside CloudMatrix384, showcasing best practices for deploying large-scale MoE models [21][80]. - The new peer-to-peer inference architecture decomposes the LLM inference system into three independent subsystems: prefill, decode, and caching, enhancing resource allocation and efficiency [23][27]. - A large-scale expert parallel (LEP) strategy is developed to optimize MoE models, allowing for high expert parallelism and minimizing execution delays [28][33]. Group 4: Cost and Utilization Benefits - Directly purchasing and operating CloudMatrix384 poses significant risks and challenges for most enterprises, including high initial costs and ongoing operational expenses [44][46]. - Huawei Cloud offers a rental model for CloudMatrix384, allowing businesses to access top-tier AI computing power without the burden of ownership [45][60]. - The cloud model maximizes resource utilization through intelligent scheduling, enabling a "daytime inference, nighttime training" approach to optimize computing resources [47][60]. Group 5: Performance Metrics - Huawei Cloud deployed a large-scale MoE model, DeepSeek-R1, on CloudMatrix384, achieving impressive throughput metrics during both the prefill and decode stages [62][70]. - The system demonstrated a throughput of 6,688 tokens per second during the prefill phase and maintained a decoding throughput of 1,943 tokens per second, showcasing its efficiency [66][69]. - The architecture allows for dynamic adjustments to balance throughput and latency, adapting to different service requirements effectively [73][80].