算芯合一！华为披露昇腾体系大模型核心算子设计细节

Core Viewpoint - Huawei's new operator technologies redefine hardware performance by achieving over 70% utilization of computing power and reducing inter-card latency to sub-microsecond levels [1][3]. Group 1: Operator Technologies - The operator is described as the "atomic tool" for AI large model computations, akin to building blocks in Lego, essential for core operations from basic arithmetic to feature extraction [2]. - The three key technologies released by Huawei for operator optimization are AMLA, fusion operator optimization, and SMTurbo, representing the ultimate form of operator optimization [2][3]. Group 2: AMLA Technology - AMLA (Ascend MLA) reinterprets floating-point operations, converting complex multiplication into addition, which significantly increases chip utilization to 71% [4][6]. - The algorithm achieves an average utilization of 55% for the Attention operator, outperforming previous results [6]. Group 3: Fusion Operator Optimization - This technology combines multiple operators into a single composite operator, optimizing the orchestration of hardware resources for seamless computation and communication [8][9]. - It enhances performance by eliminating redundant data transfers and restructuring computation flows through mathematical equivalences [9]. Group 4: SMTurbo Technology - SMTurbo enables ultra-low latency memory access across 384 cards, entering the sub-microsecond era with native Load/Store semantics [11][12]. - The technology improves memory throughput by over 20% per thread in cross-machine memory communication scenarios [12]. Group 5: Future Outlook - Future developments for AMLA will focus on optimizing MLA operators for KVCache quantization and expanding application scenarios [14]. - The fusion operator optimization will explore applications in more model architectures to enhance efficient inference of large language models on Ascend hardware [14]. - Load/Store optimization will balance read/write loads and implement sophisticated pipelining for practical gains in large batch sizes [14].