大模型推理

Search documents
大模型推理,得讲性价比
虎嗅APP· 2025-06-06 10:10
Core Insights - The article discusses the evolution and optimization of the Mixture of Experts (MoE) model, highlighting Huawei's innovative MoGE architecture that addresses inefficiencies in the original MoE model and enhances cost-effectiveness and deployment ease [1][3]. Group 1: MoE Model Evolution - The MoE model has become a key path for improving large model inference efficiency due to its dynamic sparse computing advantages [3]. - Huawei's Pangu Pro MoE 72B model significantly reduces computational costs and ranks first domestically in the SuperCLUE benchmark for models with over 100 billion parameters [3]. - The Pangu Pro MoE model achieves a 6-8 times improvement in inference performance through system-level optimizations and can reach a throughput of 321 tokens/s on the Ascend 300I Duo [3][30]. Group 2: Optimization Strategies - Huawei's H2P (Hierarchical & Hybrid Parallelism) strategy enhances inference efficiency by allowing specialized communication within task-specific groups rather than a "full team meeting" approach [5][6]. - The TopoComm optimization focuses on reducing communication overhead and improving data transmission efficiency, achieving a 35% reduction in synchronization operations [8][10]. - The DuoStream optimization allows for concurrent execution of communication and computation, significantly improving inference efficiency [11]. Group 3: Operator Fusion - Huawei has developed two specialized fusion operators, MulAttention and SwiftGMM, to optimize resource access and computation scheduling, leading to substantial performance improvements [15][17]. - MulAttention enhances attention computation speed by 4.5 times and achieves over 89% data transfer efficiency [17]. - SwiftGMM accelerates GMM computation by 2.1 times and reduces end-to-end inference latency by 48.7% [20]. Group 4: Inference Algorithm Acceleration - The PreMoE algorithm dynamically prunes experts in the MoE model, improving throughput by over 10% while maintaining accuracy [25]. - The TrimR algorithm reduces unnecessary inference steps by 14% by monitoring and adjusting the model's reasoning process [26]. - The SpecReason algorithm leverages smaller models to enhance the efficiency of larger models, resulting in a 30% increase in throughput [27]. Group 5: Performance Breakthroughs - The Ascend 800I A2 platform demonstrates exceptional performance with a single-card throughput of 1528 tokens/s under optimal conditions [30][31]. - The Ascend 300I Duo platform offers a cost-effective solution for MoE model inference, achieving a maximum throughput of 321 tokens/s [32][33]. - Overall, Huawei's optimizations have established a robust foundation for high-performance, large-scale, and low-cost inference capabilities [33].
MoE推理「王炸」组合:昇腾×盘古让推理性能狂飙6-8倍
机器之心· 2025-06-06 09:36
Core Viewpoint - The article emphasizes the significant advancements in the Pangu Pro MoE 72B model developed by Huawei, highlighting its efficiency in large model inference through innovative techniques and optimizations, which have led to substantial performance improvements in AI applications [2][23]. Group 1: Model Performance and Optimization - The Pangu Pro MoE model achieves a 6-8 times improvement in inference performance through system-level optimizations, including high-performance operator fusion and model-native speculative algorithms [3][23]. - The model's throughput reaches 321 tokens/s on the Ascend 300I Duo and can soar to 1528 tokens/s on the Ascend 800I A2, showcasing its capability to fully leverage hardware potential [3][24]. Group 2: Hierarchical and Hybrid Parallelism - Huawei introduces a novel Hierarchical & Hybrid Parallelism (H P) strategy, which enhances efficiency by allowing specialized communication and computation without the need for all components to engage simultaneously [6][7]. - This strategy results in a 33.1% increase in decode throughput compared to traditional parallel processing methods [7]. Group 3: Communication Optimization - The TopoComm optimization scheme reduces static overhead and improves data transmission efficiency, achieving a 35% reduction in synchronization operations and a 21% increase in effective bandwidth [9][12]. - The introduction of mixed quantization communication strategies leads to a 25% reduction in communication data size and a 39% decrease in AllGather communication time [9]. Group 4: Operator Fusion and Efficiency - The development of fusion operators like MulAttention and SwiftGMM addresses the inefficiencies of traditional operators, significantly enhancing memory access and computation scheduling [15][18]. - MulAttention achieves a 4.5 times acceleration in attention computation, while SwiftGMM reduces inference latency by 48.7% [16][18]. Group 5: Dynamic Pruning and Collaborative Optimization - The PreMoE dynamic pruning algorithm enhances inference throughput by over 10% by selectively activating relevant experts for specific tasks [21]. - The TrimR and SpecReason algorithms optimize the reasoning process, reducing unnecessary computation and improving throughput by 30% [20][22]. Group 6: Overall System Optimization - The comprehensive optimization of the Ascend Pangu inference system establishes a robust foundation for high-performance, large-scale, and cost-effective AI model deployment [28].
中移齐鲁创新院发布“迅测”工具:助力国产芯片选型效率跃升
Qi Lu Wan Bao· 2025-06-06 08:15
Core Insights - The continuous decrease in inference costs of large models and rapid development of domestic chips are driving the localization and deployment of inference models in data-sensitive industries such as government, finance, and healthcare [1] - The "XunTest" tool developed by China Mobile Qilu Innovation Institute addresses the challenge of efficiently and accurately testing model inference performance, which is crucial for selecting high-performance and low-cost chips [1][3] Group 1 - The "XunTest" tool innovatively constructs a "configuration equals testing" model and integrates powerful automatic data parsing capabilities, significantly improving chip selection efficiency [1][3] - The application of the "XunTest" tool has led to a remarkable transformation in efficiency, reducing the average manual monitoring time for a single test from 8 hours to 0.5 hours and decreasing data organization time by 70%, resulting in an overall chip selection efficiency improvement of 3 times [1][3] Group 2 - The core competitiveness of "XunTest" lies in two major technological highlights: intelligent automatic testing based on vLLM and automatic summarization and visualization of testing data [3] - The tool supports both local and remote testing modes, allowing flexible adaptation to different chip deployment needs, and enables engineers to initiate full-process tasks with a single configuration, overcoming the inefficiencies of traditional manual testing [3] - A standardized data storage mechanism is employed to automatically calculate and generate key performance indicators, ensuring comparability of test results across different chip platforms and heterogeneous environments [3]
华为的三个黑科技,要颠覆AI计算?
虎嗅APP· 2025-05-23 11:47
Core Viewpoint - The article discusses the challenges faced by Chinese companies in the large model AI sector, particularly regarding the MoE architecture's inherent inefficiencies and high hardware costs. It highlights Huawei's innovative approach to enhance efficiency and user experience through its DeepSeek technology, aiming to create a sustainable collaborative ecosystem in the AI industry [1]. Group 1: Huawei's Technological Innovations - Huawei has introduced three significant hardware affinity operator technologies: AMLA, Fusion Operator Optimization, and SMTurbo, which aim to revolutionize the speed and energy efficiency of large model inference [4][5]. - AMLA (Ascend MLA) redefines floating-point operations, achieving a chip utilization rate exceeding 70% by transforming complex multiplication into simpler addition operations, thus enhancing computational efficiency [7][9]. - Fusion Operator Optimization integrates multiple operators into a single composite operator, optimizing parallelism and eliminating redundant data transfers, leading to significant performance improvements in model inference [11][12]. Group 2: Performance Enhancements - SMTurbo technology enables ultra-low latency memory access across 384 cards, significantly improving memory throughput by over 20% per thread in cross-machine memory communication scenarios [14][16]. - The combination of these technologies positions Huawei's DeepSeek as a competitive alternative to existing solutions, potentially outperforming Nvidia in inference performance [20][22]. Group 3: Future Development Directions - Future research on AMLA will focus on optimizing MLA operators for KVCache quantization and full quantization scenarios, expanding the application of operators [17]. - The exploration of fusion operator optimization will continue, aiming to enhance the efficiency of large language models on Ascend hardware [17]. - Load/Store optimization will be refined to balance read and write loads, integrating the CPP concept into DeepSeek dispatch and combine scenarios for practical benefits at large batch sizes [17].
大模型推理,不再是“一根筋”
虎嗅APP· 2025-05-22 11:41
Core Viewpoint - The article discusses the challenges and innovations in deploying large models, particularly focusing on Huawei's approach to enhance efficiency and user experience in the context of large language models and the Mixture of Experts (MoE) architecture [1][2]. Group 1: Challenges in Large Model Deployment - The MoE architecture faces significant hardware costs and efficiency issues, making it difficult for Chinese companies to accelerate in the competitive landscape of AI [1]. - As the scale of MoE models continues to grow, the number of experts and total parameters increases exponentially, leading to severe challenges in storage and scheduling [7]. - Traditional communication strategies like AllReduce are inadequate in high concurrency scenarios, leading to inefficiencies in large model inference [8]. Group 2: Innovations by Huawei - Huawei's multi-stream parallel technology breaks the serial constraints of computation, allowing for simultaneous processing of different data streams, significantly reducing key path latency [12][15]. - The AllReduce operation has been innovatively restructured to improve communication efficiency, reducing data transmission volume by 35% and enhancing inference performance by 22-26% [15][17]. - Huawei's FlashComm technology optimizes communication in large model inference by leveraging low-dimensional data characteristics, thus improving end-to-end inference performance [21]. Group 3: Future Directions - Huawei plans to continue innovating in areas such as multi-stream parallelism and automatic weight pre-fetching to further enhance the performance of large model inference systems [21].
帮大模型提速80%,华为拿出昇腾推理杀手锏FlashComm,三招搞定通算瓶颈
机器之心· 2025-05-22 10:25
Core Viewpoint - The article discusses the optimization of large model inference communication through Huawei's FlashComm technology, addressing the challenges posed by the exponential growth of model parameters and the need for efficient communication strategies in distributed computing environments [2][6][17]. Group 1: Communication Challenges - The rapid increase in the scale of clusters and inference concurrency in large language models has led to significant communication pressures, particularly with the expansion of Mixture of Experts (MoE) models, where the number of experts and total parameters grows exponentially [6][18]. - Traditional communication strategies, such as AllReduce, face limitations in high concurrency scenarios, leading to increased end-to-end inference latency due to bandwidth constraints [6][8]. Group 2: FlashComm Innovations - FlashComm1 optimizes AllReduce communication by decomposing it into ReduceScatter and AllGather operations, resulting in a 26% performance improvement in inference [7][11]. - FlashComm2 redefines the balance between computation and communication by transforming three-dimensional tensors into two-dimensional matrices, achieving a 33% increase in overall inference speed [7][14]. - FlashComm3 leverages multi-stream parallelism to enhance the efficiency of MoE model inference, resulting in a throughput increase of 25%-30% during the decoding phase [7][15]. Group 3: Future Directions - The Huawei team aims to continue innovating in areas such as multi-stream parallelism, automatic weight prefetching, and model parallelism to further enhance the performance of large model inference systems [17][18].
帮大模型提速80%,华为拿出昇腾推理杀手锏FlashComm,三招搞定通算瓶颈
机器之心· 2025-05-22 04:13
Core Viewpoint - The article discusses the optimization of large model inference communication through Huawei's FlashComm technology, addressing the challenges posed by the exponential growth of parameters and experts in large language models (LLMs) [2][6][17]. Group 1: Communication Challenges - The increasing scale of clusters and inference concurrency in LLMs leads to significant communication pressure, particularly with the expansion of Mixture of Experts (MoE) models, where the number of experts and total parameters grows exponentially [6][18]. - Traditional communication strategies like AllReduce face limitations in high concurrency scenarios, leading to increased end-to-end inference latency due to bandwidth constraints [6][8]. Group 2: FlashComm Innovations - FlashComm1 optimizes AllReduce communication by decomposing it into ReduceScatter and AllGather, resulting in a 26% performance improvement in inference [7][11]. - FlashComm2 redefines the balance between computation and communication by transforming three-dimensional tensors into two-dimensional matrices, achieving a 33% increase in overall inference speed [7][14]. - FlashComm3 leverages multi-stream parallelism to enhance the efficiency of MoE model inference, resulting in a throughput increase of 25%-30% during the decoding phase [7][15]. Group 3: Future Directions - The Huawei team aims to further innovate in areas such as multi-stream parallelism, automatic weight prefetching, and model automatic multi-stream parallelism to enhance the performance of large model inference systems [17][18].
推理性能PK,华为+DeepSeek>英伟达?
虎嗅APP· 2025-05-19 13:47
虎嗅注: "大模型江湖,落地为王。"这句话的含金量还在提升。随着DeepSeek V3/R1在春节期间一夜爆火, 基于超大规模MoE(Mixture of Experts)架构的大模型正在从训练开发转向推理应用的落地。 对于MoE推理部署来说,效率一直是一个痛点。谁能将部署计算效率提升至最高,才能真正获得大 模型商业成功。但受限于庞大的模型容量与计算需求,传统部署方案通常依赖于多张数据中心级 GPU(如H20)。你我都知道,英伟达不仅贵,而且不断受到地缘政治摩擦的影响,不断降低自己的 性能来满足监管需求。 而在最近,华为全面揭秘超大规模MoE模型推理部署技术,不仅实现了国产的进一步突破,更全面 超越了基于英伟达Hopper架构的推理部署性能。 他们是怎么做到的? 数学补物理,极致提升计算效率 "数学补物理",这种通过数学理论、工具、算法和建模等方式,来弥补硬件和工艺的局限性,实现最 大化发挥芯片和系统能力效果。华为轮值董事长孟晚舟曾在2025年新年致辞中提到: "华为十多个实验室与伙伴们的工程师组成"大杂烩"团队,面对天成AI集群系统和单芯片性能的严峻 工程挑战,他们创造性应用数学补物理、非摩尔补摩尔、系统补 ...
不到15万元!清华90后团队发布“褐蚁”一体机,已支持阿里最新Qwen3模型|钛媒体AGI
Tai Mei Ti A P P· 2025-04-30 15:09
行云集成电路创始人、CEO季宇 4月30日消息,钛媒体AGI获悉,清华90后创立的北京行云集成电路有限公司(简称"行云集成电路") 宣布,推出全新的一体机产品"褐蚁",仅需最高15万元就可以跑满血版DeepSeek R1/V3大模型,并且对 话速度达到了20token/s。 今天下午,行云集成电路创始人、CEO季宇对钛媒体AGI表示,目前"褐蚁"一体机已经支持阿里最新发 布的Qwen3系列开源大模型,包括顶配版Qwen3-235B-A22B。 具体来说,"褐蚁"一体机有三款不同的配置:最高性价比的"超大杯"褐蚁HY90,搭载双路AMD EPYC 9355服务器、24条 48G 6400M频率内存和NV 5090D计算卡,支持FP8、INT4两种数据精度,在FP8精度 下跑满血版DS能达到21token/s的对话速度,在INT4精度下则能达到28token/s,最高支持128K的上下 文,售价14.9万元;此外,行云集成电路还将推出"大杯"褐蚁HY70、"中杯"褐蚁HY50两个配置版本。 | 型号 | 福盛 HY90 | 褐蚁 HY70 | 褐蚁 HY50 | | --- | --- | --- | --- | ...
14.9万元,满血流畅运行DeepSeek一体机抱回家!清华90后初创出品
量子位· 2025-04-29 04:18
金磊 发自 凹非寺 量子位 | 公众号 QbitAI 满血DeepSeek一体机 ,价格竟然被打到 10万元 级别了! 而且还不是量化版本,正是那个671B参数、最高质量的FP8原版。 △ 左:一体机;右:DeepSeek官网 从视频中不难看出,不仅答案精准,一体机的速度也是肉眼可见地比DeepSeek官网快上一些,粗略估计是已经接近了 22 tokens/s 。 那么这个一体机到底是什么来头? 或许有小伙伴要问了,那跑DeepSeek-R1/V3的 速度 ,能跟官方一较高下吗? 可以的,甚至是 更快 的那种。例如我们提个问题,来感受一下这个feel: 一个汉字具有左右结构,左边是木,右边是乞。这个字是什么?只需回答这个字即可。 不卖关子,它就是由北京 行云集成电路 最新推出的产品—— 褐蚁HY90 ,具体价格定到了 14.9万元 。 而且除了产品,这家公司本身也是有不少的"标签"在身上的,其中最为吸睛或许当属CEO了: 季宇 ,清华90后博士、前华为"天才少年"、计算机学会CCF优博奖获得者。 那么褐蚁HY90具体执行起更多任务时,又会是什么样的效果? 来,更多维度的一波实测走起。 实测10万元级的Deep ...