Core Viewpoint - The article discusses the rapid advancements in large language models (LLMs) and the challenges they face in inference, particularly regarding speed and energy efficiency. It highlights Huawei's innovative solutions to optimize these models through hardware-software integration, focusing on three key technologies that enhance inference speed and energy efficiency [2][4][11]. Group 1: Key Technologies - AMLA technology transforms complex multiplication into addition operations, significantly increasing chip utilization rates to 71% and improving performance by over 30% in the attention operator [4][5]. - The fusion operator optimization combines multiple operators into a single composite operator, enhancing parallel processing and reducing redundant data movement, leading to substantial performance improvements in model inference [7][9]. - SMTurbo technology enables ultra-low latency memory sharing across 384 cards, achieving sub-microsecond delays and enhancing memory access throughput by over 20% in cross-machine communication scenarios [10][9]. Group 2: Future Developments - Future research on AMLA will focus on optimizing the MLA operator for quantization scenarios, expanding its application [12]. - The fusion operator optimization will explore its application across more model architectures, promoting efficient inference of large language models on Huawei's Ascend hardware [12]. - Load/Store optimization will balance read and write loads, aiming for practical benefits in large batch sizes within Deepseek dispatch and combine scenarios [12].
以加代乘?华为数学家出手,昇腾算子的高能设计与优化,性能提升30%!
机器之心·2025-05-23 04:17