算子优化

Search documents
算芯合一!华为披露昇腾体系大模型核心算子设计细节
雷峰网· 2025-05-23 10:01
Core Viewpoint - Huawei's new operator technologies redefine hardware performance by achieving over 70% utilization of computing power and reducing inter-card latency to sub-microsecond levels [1][3]. Group 1: Operator Technologies - The operator is described as the "atomic tool" for AI large model computations, akin to building blocks in Lego, essential for core operations from basic arithmetic to feature extraction [2]. - The three key technologies released by Huawei for operator optimization are AMLA, fusion operator optimization, and SMTurbo, representing the ultimate form of operator optimization [2][3]. Group 2: AMLA Technology - AMLA (Ascend MLA) reinterprets floating-point operations, converting complex multiplication into addition, which significantly increases chip utilization to 71% [4][6]. - The algorithm achieves an average utilization of 55% for the Attention operator, outperforming previous results [6]. Group 3: Fusion Operator Optimization - This technology combines multiple operators into a single composite operator, optimizing the orchestration of hardware resources for seamless computation and communication [8][9]. - It enhances performance by eliminating redundant data transfers and restructuring computation flows through mathematical equivalences [9]. Group 4: SMTurbo Technology - SMTurbo enables ultra-low latency memory access across 384 cards, entering the sub-microsecond era with native Load/Store semantics [11][12]. - The technology improves memory throughput by over 20% per thread in cross-machine memory communication scenarios [12]. Group 5: Future Outlook - Future developments for AMLA will focus on optimizing MLA operators for KVCache quantization and expanding application scenarios [14]. - The fusion operator optimization will explore applications in more model architectures to enhance efficient inference of large language models on Ascend hardware [14]. - Load/Store optimization will balance read/write loads and implement sophisticated pipelining for practical gains in large batch sizes [14].
以加代乘?华为数学家出手,昇腾算子的高能设计与优化,性能提升30%!
机器之心· 2025-05-23 04:17
Core Viewpoint - The article discusses the rapid advancements in large language models (LLMs) and the challenges they face in inference, particularly regarding speed and energy efficiency. It highlights Huawei's innovative solutions to optimize these models through hardware-software integration, focusing on three key technologies that enhance inference speed and energy efficiency [2][4][11]. Group 1: Key Technologies - AMLA technology transforms complex multiplication into addition operations, significantly increasing chip utilization rates to 71% and improving performance by over 30% in the attention operator [4][5]. - The fusion operator optimization combines multiple operators into a single composite operator, enhancing parallel processing and reducing redundant data movement, leading to substantial performance improvements in model inference [7][9]. - SMTurbo technology enables ultra-low latency memory sharing across 384 cards, achieving sub-microsecond delays and enhancing memory access throughput by over 20% in cross-machine communication scenarios [10][9]. Group 2: Future Developments - Future research on AMLA will focus on optimizing the MLA operator for quantization scenarios, expanding its application [12]. - The fusion operator optimization will explore its application across more model architectures, promoting efficient inference of large language models on Huawei's Ascend hardware [12]. - Load/Store optimization will balance read and write loads, aiming for practical benefits in large batch sizes within Deepseek dispatch and combine scenarios [12].
与 00 后开源者聊 DeepSeek 开源周:一直开源最强模型,可能是不想赚钱,也可能是想推动更大变化丨开源对话#2
晚点LatePost· 2025-02-27 14:03
"当 AI 足够强大后,开源还是不是一个好选择?" 整理丨刘倩 程曼祺 嘉宾丨美国西北大学 MLL Lab 博士王子涵 ▲扫描上图中的二维码,可收听播客。《晚点聊 LateTalk》#102 期节目。欢迎在小宇宙、喜马拉雅、苹果 Podcast 等渠道关注、收听我们。 《晚点聊 LateTalk》是《晚点 LatePost》 推出的播客节目。"最一手的商业、科技访谈,最真实的从业者思考。" 这是《晚点 LatePost》 「开源对话」系列的第 2 篇。该系列将收录与开源相关的访谈与讨论。系列文章见文末的合集#开源对话。 上周五,DeepSeek 在官方 Twitter 上预告了下一周会连续 5 天开源 5 个代码库,进入 "open-source week"开源周。 目前 DeepSeek 已放出的 4 个库,主要涉及 DeepSeek-V3/R1 相关的训练与推理代码 。 这是比发布技术报告和开源模型权重更深度的开源。 有了训练和推理 工具,开发者才能更好地在自己的系统里,实现 DeepSeek 系列模型的高效表现。 (注:所有 4 个库和后续开源可见 DeepSeek GitHub 中的 Open-Inf ...