MindIE Motor

Search documents
从“积木堆叠”到“有机生命体”:昇腾超节点重新定义AI算力架构
Huan Qiu Wang· 2025-05-26 10:06
Core Insights - The rapid growth of large models in AI is driving a new era of computing power demand, highlighting the limitations of traditional cluster architectures in efficiently training these models [1][2] - Traditional architectures face significant challenges, including communication bottlenecks, inefficient resource allocation, and reliability issues, which hinder the training efficiency of large models [2][3] Summary by Sections Challenges in Traditional Architectures - Communication bottlenecks have worsened exponentially, with MoE models increasing inter-node communication demands, leading to delays of over 2ms in traditional 400G networks [1][2] - Resource allocation is static and unable to adapt to dynamic changes in model structure, resulting in a 30% decrease in overall training efficiency due to uneven load distribution [1][2] - Reliability is compromised as the probability of node failure increases with scale, causing significant resource waste during lengthy recovery processes, with some companies losing over a million dollars per training interruption [2] Emergence of Ascend Supernode Architecture - The Ascend Supernode architecture represents a fundamental restructuring of computing power systems, characterized by a "three-dimensional integration" approach [3][5] - A breakthrough in hardware interconnectivity allows multiple NPUs to work as a single computer, increasing inter-node communication bandwidth by 15 times and reducing latency from 2ms to 0.2ms [3][5] - Unified global memory addressing through virtualization enables direct memory access across nodes, enhancing efficiency in parameter synchronization during model training [5][6] Innovations in Resource Management and Reliability - Intelligent resource scheduling allows for fine-grained dynamic task allocation based on the MoE model structure, improving the compute-to-communication time ratio from 1:1 to 3:1 [5][6] - The reliability of the system has been significantly improved, with average uptime increasing from hours to days, and recovery times reduced from hours to 15 minutes [5][6] Industry Impact and Future Prospects - The Ascend Supernode architecture has achieved a threefold increase in training performance compared to traditional nodes, establishing a new benchmark in AI computing [8] - The introduction of MindIE Motor enhances large-scale expert parallel capabilities, achieving four times the throughput of traditional server stacks [8] - Huawei's commitment to architecture innovation is seen as a new form of Moore's Law, positioning the company as a leader in the AI computing landscape [9]