Workflow
AMLA
icon
Search documents
华为放出「准万亿级MoE推理」大招,两大杀手级优化技术直接开源
机器之心· 2025-11-28 04:11
机器之心报道 编辑:杜伟 2025 年已接近尾声,这一年里,大模型加速从单点提效工具升级为支撑业务系统的底层基础设施。过程中,推理效率决定了大模型能否真正 落地。对于超大规模 MoE 模型,复杂推理链路带来了计算、通信、访存等方面的挑战,亟需行业给出高效可控的推理路径。 华为亮出了面向准万亿参数 MoE 推理的完整技术栈:openPangu-Ultra-MoE-718B-V1.1 展现 MoE 架构的模型潜力、 包括 Omni Proxy 调度特 性、将昇腾硬件算力利用率推至 86% 的 AMLA 技术在内的昇腾亲和加速技术, 使得超大规模 MoE 模型具备了走向生产级部署的现实可行 性。开源实现: https://gitcode.com/ascend-tribe/ascend-inference-cluster# 如果说过去数年大模型竞争的焦点在训练规模与能力突破上,那么如今,推理效率正迅速成为影响模型能否落地的关键变量。 模型 GitCode 地址:https://ai.gitcode.com/ascend-tribe/openPangu-Ultra-MoE-718B-V1.1-Int8 从任务属性来看, ...
华为的三个黑科技,要颠覆AI计算?
虎嗅APP· 2025-05-23 11:47
Core Viewpoint - The article discusses the challenges faced by Chinese companies in the large model AI sector, particularly regarding the MoE architecture's inherent inefficiencies and high hardware costs. It highlights Huawei's innovative approach to enhance efficiency and user experience through its DeepSeek technology, aiming to create a sustainable collaborative ecosystem in the AI industry [1]. Group 1: Huawei's Technological Innovations - Huawei has introduced three significant hardware affinity operator technologies: AMLA, Fusion Operator Optimization, and SMTurbo, which aim to revolutionize the speed and energy efficiency of large model inference [4][5]. - AMLA (Ascend MLA) redefines floating-point operations, achieving a chip utilization rate exceeding 70% by transforming complex multiplication into simpler addition operations, thus enhancing computational efficiency [7][9]. - Fusion Operator Optimization integrates multiple operators into a single composite operator, optimizing parallelism and eliminating redundant data transfers, leading to significant performance improvements in model inference [11][12]. Group 2: Performance Enhancements - SMTurbo technology enables ultra-low latency memory access across 384 cards, significantly improving memory throughput by over 20% per thread in cross-machine memory communication scenarios [14][16]. - The combination of these technologies positions Huawei's DeepSeek as a competitive alternative to existing solutions, potentially outperforming Nvidia in inference performance [20][22]. Group 3: Future Development Directions - Future research on AMLA will focus on optimizing MLA operators for KVCache quantization and full quantization scenarios, expanding the application of operators [17]. - The exploration of fusion operator optimization will continue, aiming to enhance the efficiency of large language models on Ascend hardware [17]. - Load/Store optimization will be refined to balance read and write loads, integrating the CPP concept into DeepSeek dispatch and combine scenarios for practical benefits at large batch sizes [17].