摩尔线程发布新一代GPU架构「花港」：国产GPU实现万卡训练与推理双突破

Core Viewpoint - The article discusses the launch of the "Huagang" architecture and the Wanka training cluster by Moole Technology, marking a new era of autonomous computing power in the high-end GPU sector, showcased at the first MUSA Developer Conference (MDC 2025) in Beijing [3]. Group 1: Key Achievements - The "Huagang" architecture was unveiled, supporting full precision calculations from FP4 to FP64, with a 50% increase in computing density and a 10-fold improvement in efficiency. Future chips based on this architecture include the "Huashan" chip for AI training and inference, and the "Lushan" chip for high-performance graphics rendering [6][11]. - The "Wanka" intelligent computing cluster was introduced, demonstrating its capability to support trillion-parameter model training, achieving international mainstream levels in several key precision metrics [7]. - A significant breakthrough in inference performance was achieved in collaboration with Silicon-based Flow, with the MTT S5000 single card achieving a Prefill throughput of over 4000 tokens/s and a Decode throughput exceeding 1000 tokens/s, setting a new benchmark for domestic inference performance [7]. - The MTT C256 super node architecture was shared, focusing on high-density hardware architecture aimed at achieving extreme intelligent computing performance [8]. - The AI computing notebook MTT AIBOOK, equipped with the intelligent SoC chip "Changjiang," was officially launched to empower 200,000 developers and learners at "Moore Academy" [9]. Group 2: Technological Innovations - The "Huagang" architecture features significant enhancements in computing performance, energy efficiency, precision support, interconnect capabilities, and graphics technology, with a robust patent barrier of 514 authorized patents as of June 30, 2025 [11]. - The architecture integrates a new asynchronous programming model and supports large-scale intelligent computing clusters with over 100,000 cards through the self-developed MTLink high-speed interconnect technology [11]. - The architecture includes an AI generative rendering framework and enhanced hardware ray tracing acceleration, fully supporting DirectX 12 Ultimate, facilitating a high degree of synergy between graphics rendering and intelligent computing [11]. - Future chip technology routes include the "Huashan" chip, focusing on AI training and inference with stable and efficient computing power for large-scale intelligent computing clusters, and the "Lushan" chip, which significantly enhances graphics performance, including a 64-fold increase in AI computing performance and a 50-fold increase in ray tracing performance [14][16].