融合算子优化 - filings, earnings calls, financial reports, news

融合算子优化

Search documents

虎嗅APP· 2025-05-23 11:47

Core Viewpoint - The article discusses the challenges faced by Chinese companies in the large model AI sector, particularly regarding the MoE architecture's inherent inefficiencies and high hardware costs. It highlights Huawei's innovative approach to enhance efficiency and user experience through its DeepSeek technology, aiming to create a sustainable collaborative ecosystem in the AI industry [1]. Group 1: Huawei's Technological Innovations - Huawei has introduced three significant hardware affinity operator technologies: AMLA, Fusion Operator Optimization, and SMTurbo, which aim to revolutionize the speed and energy efficiency of large model inference [4][5]. - AMLA (Ascend MLA) redefines floating-point operations, achieving a chip utilization rate exceeding 70% by transforming complex multiplication into simpler addition operations, thus enhancing computational efficiency [7][9]. - Fusion Operator Optimization integrates multiple operators into a single composite operator, optimizing parallelism and eliminating redundant data transfers, leading to significant performance improvements in model inference [11][12]. Group 2: Performance Enhancements - SMTurbo technology enables ultra-low latency memory access across 384 cards, significantly improving memory throughput by over 20% per thread in cross-machine memory communication scenarios [14][16]. - The combination of these technologies positions Huawei's DeepSeek as a competitive alternative to existing solutions, potentially outperforming Nvidia in inference performance [20][22]. Group 3: Future Development Directions - Future research on AMLA will focus on optimizing MLA operators for KVCache quantization and full quantization scenarios, expanding the application of operators [17]. - The exploration of fusion operator optimization will continue, aiming to enhance the efficiency of large language models on Ascend hardware [17]. - Load/Store optimization will be refined to balance read and write loads, integrating the CPP concept into DeepSeek dispatch and combine scenarios for practical benefits at large batch sizes [17].