Workflow
大模型推理
icon
Search documents
中移齐鲁创新院发布“迅测”工具:助力国产芯片选型效率跃升
Qi Lu Wan Bao· 2025-06-06 08:15
Core Insights - The continuous decrease in inference costs of large models and rapid development of domestic chips are driving the localization and deployment of inference models in data-sensitive industries such as government, finance, and healthcare [1] - The "XunTest" tool developed by China Mobile Qilu Innovation Institute addresses the challenge of efficiently and accurately testing model inference performance, which is crucial for selecting high-performance and low-cost chips [1][3] Group 1 - The "XunTest" tool innovatively constructs a "configuration equals testing" model and integrates powerful automatic data parsing capabilities, significantly improving chip selection efficiency [1][3] - The application of the "XunTest" tool has led to a remarkable transformation in efficiency, reducing the average manual monitoring time for a single test from 8 hours to 0.5 hours and decreasing data organization time by 70%, resulting in an overall chip selection efficiency improvement of 3 times [1][3] Group 2 - The core competitiveness of "XunTest" lies in two major technological highlights: intelligent automatic testing based on vLLM and automatic summarization and visualization of testing data [3] - The tool supports both local and remote testing modes, allowing flexible adaptation to different chip deployment needs, and enables engineers to initiate full-process tasks with a single configuration, overcoming the inefficiencies of traditional manual testing [3] - A standardized data storage mechanism is employed to automatically calculate and generate key performance indicators, ensuring comparability of test results across different chip platforms and heterogeneous environments [3]
华为的三个黑科技,要颠覆AI计算?
虎嗅APP· 2025-05-23 11:47
Core Viewpoint - The article discusses the challenges faced by Chinese companies in the large model AI sector, particularly regarding the MoE architecture's inherent inefficiencies and high hardware costs. It highlights Huawei's innovative approach to enhance efficiency and user experience through its DeepSeek technology, aiming to create a sustainable collaborative ecosystem in the AI industry [1]. Group 1: Huawei's Technological Innovations - Huawei has introduced three significant hardware affinity operator technologies: AMLA, Fusion Operator Optimization, and SMTurbo, which aim to revolutionize the speed and energy efficiency of large model inference [4][5]. - AMLA (Ascend MLA) redefines floating-point operations, achieving a chip utilization rate exceeding 70% by transforming complex multiplication into simpler addition operations, thus enhancing computational efficiency [7][9]. - Fusion Operator Optimization integrates multiple operators into a single composite operator, optimizing parallelism and eliminating redundant data transfers, leading to significant performance improvements in model inference [11][12]. Group 2: Performance Enhancements - SMTurbo technology enables ultra-low latency memory access across 384 cards, significantly improving memory throughput by over 20% per thread in cross-machine memory communication scenarios [14][16]. - The combination of these technologies positions Huawei's DeepSeek as a competitive alternative to existing solutions, potentially outperforming Nvidia in inference performance [20][22]. Group 3: Future Development Directions - Future research on AMLA will focus on optimizing MLA operators for KVCache quantization and full quantization scenarios, expanding the application of operators [17]. - The exploration of fusion operator optimization will continue, aiming to enhance the efficiency of large language models on Ascend hardware [17]. - Load/Store optimization will be refined to balance read and write loads, integrating the CPP concept into DeepSeek dispatch and combine scenarios for practical benefits at large batch sizes [17].
大模型推理,不再是“一根筋”
虎嗅APP· 2025-05-22 11:41
Core Viewpoint - The article discusses the challenges and innovations in deploying large models, particularly focusing on Huawei's approach to enhance efficiency and user experience in the context of large language models and the Mixture of Experts (MoE) architecture [1][2]. Group 1: Challenges in Large Model Deployment - The MoE architecture faces significant hardware costs and efficiency issues, making it difficult for Chinese companies to accelerate in the competitive landscape of AI [1]. - As the scale of MoE models continues to grow, the number of experts and total parameters increases exponentially, leading to severe challenges in storage and scheduling [7]. - Traditional communication strategies like AllReduce are inadequate in high concurrency scenarios, leading to inefficiencies in large model inference [8]. Group 2: Innovations by Huawei - Huawei's multi-stream parallel technology breaks the serial constraints of computation, allowing for simultaneous processing of different data streams, significantly reducing key path latency [12][15]. - The AllReduce operation has been innovatively restructured to improve communication efficiency, reducing data transmission volume by 35% and enhancing inference performance by 22-26% [15][17]. - Huawei's FlashComm technology optimizes communication in large model inference by leveraging low-dimensional data characteristics, thus improving end-to-end inference performance [21]. Group 3: Future Directions - Huawei plans to continue innovating in areas such as multi-stream parallelism and automatic weight pre-fetching to further enhance the performance of large model inference systems [21].
帮大模型提速80%,华为拿出昇腾推理杀手锏FlashComm,三招搞定通算瓶颈
机器之心· 2025-05-22 10:25
Core Viewpoint - The article discusses the optimization of large model inference communication through Huawei's FlashComm technology, addressing the challenges posed by the exponential growth of model parameters and the need for efficient communication strategies in distributed computing environments [2][6][17]. Group 1: Communication Challenges - The rapid increase in the scale of clusters and inference concurrency in large language models has led to significant communication pressures, particularly with the expansion of Mixture of Experts (MoE) models, where the number of experts and total parameters grows exponentially [6][18]. - Traditional communication strategies, such as AllReduce, face limitations in high concurrency scenarios, leading to increased end-to-end inference latency due to bandwidth constraints [6][8]. Group 2: FlashComm Innovations - FlashComm1 optimizes AllReduce communication by decomposing it into ReduceScatter and AllGather operations, resulting in a 26% performance improvement in inference [7][11]. - FlashComm2 redefines the balance between computation and communication by transforming three-dimensional tensors into two-dimensional matrices, achieving a 33% increase in overall inference speed [7][14]. - FlashComm3 leverages multi-stream parallelism to enhance the efficiency of MoE model inference, resulting in a throughput increase of 25%-30% during the decoding phase [7][15]. Group 3: Future Directions - The Huawei team aims to continue innovating in areas such as multi-stream parallelism, automatic weight prefetching, and model parallelism to further enhance the performance of large model inference systems [17][18].
帮大模型提速80%,华为拿出昇腾推理杀手锏FlashComm,三招搞定通算瓶颈
机器之心· 2025-05-22 04:13
Core Viewpoint - The article discusses the optimization of large model inference communication through Huawei's FlashComm technology, addressing the challenges posed by the exponential growth of parameters and experts in large language models (LLMs) [2][6][17]. Group 1: Communication Challenges - The increasing scale of clusters and inference concurrency in LLMs leads to significant communication pressure, particularly with the expansion of Mixture of Experts (MoE) models, where the number of experts and total parameters grows exponentially [6][18]. - Traditional communication strategies like AllReduce face limitations in high concurrency scenarios, leading to increased end-to-end inference latency due to bandwidth constraints [6][8]. Group 2: FlashComm Innovations - FlashComm1 optimizes AllReduce communication by decomposing it into ReduceScatter and AllGather, resulting in a 26% performance improvement in inference [7][11]. - FlashComm2 redefines the balance between computation and communication by transforming three-dimensional tensors into two-dimensional matrices, achieving a 33% increase in overall inference speed [7][14]. - FlashComm3 leverages multi-stream parallelism to enhance the efficiency of MoE model inference, resulting in a throughput increase of 25%-30% during the decoding phase [7][15]. Group 3: Future Directions - The Huawei team aims to further innovate in areas such as multi-stream parallelism, automatic weight prefetching, and model automatic multi-stream parallelism to enhance the performance of large model inference systems [17][18].
推理性能PK,华为+DeepSeek>英伟达?
虎嗅APP· 2025-05-19 13:47
虎嗅注: "大模型江湖,落地为王。"这句话的含金量还在提升。随着DeepSeek V3/R1在春节期间一夜爆火, 基于超大规模MoE(Mixture of Experts)架构的大模型正在从训练开发转向推理应用的落地。 对于MoE推理部署来说,效率一直是一个痛点。谁能将部署计算效率提升至最高,才能真正获得大 模型商业成功。但受限于庞大的模型容量与计算需求,传统部署方案通常依赖于多张数据中心级 GPU(如H20)。你我都知道,英伟达不仅贵,而且不断受到地缘政治摩擦的影响,不断降低自己的 性能来满足监管需求。 而在最近,华为全面揭秘超大规模MoE模型推理部署技术,不仅实现了国产的进一步突破,更全面 超越了基于英伟达Hopper架构的推理部署性能。 他们是怎么做到的? 数学补物理,极致提升计算效率 "数学补物理",这种通过数学理论、工具、算法和建模等方式,来弥补硬件和工艺的局限性,实现最 大化发挥芯片和系统能力效果。华为轮值董事长孟晚舟曾在2025年新年致辞中提到: "华为十多个实验室与伙伴们的工程师组成"大杂烩"团队,面对天成AI集群系统和单芯片性能的严峻 工程挑战,他们创造性应用数学补物理、非摩尔补摩尔、系统补 ...
不到15万元!清华90后团队发布“褐蚁”一体机,已支持阿里最新Qwen3模型|钛媒体AGI
Tai Mei Ti A P P· 2025-04-30 15:09
行云集成电路创始人、CEO季宇 4月30日消息,钛媒体AGI获悉,清华90后创立的北京行云集成电路有限公司(简称"行云集成电路") 宣布,推出全新的一体机产品"褐蚁",仅需最高15万元就可以跑满血版DeepSeek R1/V3大模型,并且对 话速度达到了20token/s。 今天下午,行云集成电路创始人、CEO季宇对钛媒体AGI表示,目前"褐蚁"一体机已经支持阿里最新发 布的Qwen3系列开源大模型,包括顶配版Qwen3-235B-A22B。 具体来说,"褐蚁"一体机有三款不同的配置:最高性价比的"超大杯"褐蚁HY90,搭载双路AMD EPYC 9355服务器、24条 48G 6400M频率内存和NV 5090D计算卡,支持FP8、INT4两种数据精度,在FP8精度 下跑满血版DS能达到21token/s的对话速度,在INT4精度下则能达到28token/s,最高支持128K的上下 文,售价14.9万元;此外,行云集成电路还将推出"大杯"褐蚁HY70、"中杯"褐蚁HY50两个配置版本。 | 型号 | 福盛 HY90 | 褐蚁 HY70 | 褐蚁 HY50 | | --- | --- | --- | --- | ...
14.9万元,满血流畅运行DeepSeek一体机抱回家!清华90后初创出品
量子位· 2025-04-29 04:18
金磊 发自 凹非寺 量子位 | 公众号 QbitAI 满血DeepSeek一体机 ,价格竟然被打到 10万元 级别了! 而且还不是量化版本,正是那个671B参数、最高质量的FP8原版。 △ 左:一体机;右:DeepSeek官网 从视频中不难看出,不仅答案精准,一体机的速度也是肉眼可见地比DeepSeek官网快上一些,粗略估计是已经接近了 22 tokens/s 。 那么这个一体机到底是什么来头? 或许有小伙伴要问了,那跑DeepSeek-R1/V3的 速度 ,能跟官方一较高下吗? 可以的,甚至是 更快 的那种。例如我们提个问题,来感受一下这个feel: 一个汉字具有左右结构,左边是木,右边是乞。这个字是什么?只需回答这个字即可。 不卖关子,它就是由北京 行云集成电路 最新推出的产品—— 褐蚁HY90 ,具体价格定到了 14.9万元 。 而且除了产品,这家公司本身也是有不少的"标签"在身上的,其中最为吸睛或许当属CEO了: 季宇 ,清华90后博士、前华为"天才少年"、计算机学会CCF优博奖获得者。 那么褐蚁HY90具体执行起更多任务时,又会是什么样的效果? 来,更多维度的一波实测走起。 实测10万元级的Deep ...
英伟达:Blackwell收入超预期,2025年推理爆发主导GPU需求-20250304
Investment Rating - The report assigns a "Buy" rating to the company with a target price of $160, representing a potential upside of 33.17% from the current price of $120.15 [2][31]. Core Insights - The company is expected to experience significant growth driven by the demand for its Blackwell products, particularly in the AI and data center sectors. The revenue for fiscal year 2025 is projected to be $393 billion, a year-over-year increase of 77.9%, surpassing previous guidance and market expectations [3][5][10]. - The gross margin for the latest quarter was reported at 73.0%, slightly below expectations due to higher short-term costs associated with ramping up Blackwell production. However, margins are expected to improve as production stabilizes [5][10]. - The company anticipates a revenue guidance midpoint of $430 billion for the next quarter, reflecting a year-over-year growth of 65.1% [10][15]. Financial Performance Summary - For the fiscal year ending January 26, 2025, total revenue is forecasted to reach $393 billion, with a net profit of $221 billion, resulting in a GAAP diluted EPS of $0.89, exceeding Bloomberg consensus estimates [3][6]. - The company generated free cash flow of $155 billion in the latest quarter, up from $115 billion in the same period last year, and returned $81 billion to shareholders through buybacks and dividends [6][10]. - The data center business saw revenue of $355.8 billion, a 93.3% increase year-over-year, driven by demand for large models and AI applications [15][19]. Product and Market Developments - The Blackwell platform is highlighted as the fastest ramping product in the company's history, with Q4 revenue reaching $110 billion, exceeding expectations. The transition from Hopper to Blackwell is noted to be more challenging, but improvements in gross margins are anticipated as production scales [10][19]. - The company launched Project DIGITS, a personal AI computer capable of running large models, showcasing its commitment to innovation in AI technology [26][20]. - The automotive business reported a revenue increase of 102.8% year-over-year, driven by rising demand for smart driving chips, with a projected market space of $5 billion for autonomous driving chips in 2025 [27][26]. Future Outlook - The company expects a compound annual growth rate (CAGR) of 29% for revenue and EPS over the next three years, supported by strong capital expenditure growth from major clients like Microsoft and Google [32][31]. - The report emphasizes the need for continuous product development and iteration to maintain competitive advantages in the rapidly evolving AI and semiconductor markets [32][19].
天翼云CPU实例部署DeepSeek-R1模型最佳实践
量子位· 2025-03-03 07:58
文章来源:天翼云网站 量子位 | 公众号 QbitAI 本文介绍了 英特尔 ® 至强 ® 处理器在AI推理领域的优势,如何使用一键部署的镜像进行纯CPU环境下基于AMX加速后的 DeepSeek-R1 7B蒸馏模型推理,以及纯CPU环境下部署DeepSeek-R1 671B满血版模型实践。 大模型因其参数规模庞大、结构复杂,通常需要强大的计算资源来支持其推理过程,这使得算力成为大模型应用的核心要素。随着DeepSeek-R1模型 的问世,各行各业纷纷展开了关于如何接入大模型能力的广泛调研与探索,市场对大模型推理算力的需求呈现出爆发式增长的趋势。 例如在医疗、金融、零售等领域,企业迫切希望通过接入DeepSeek大模型来提升决策效率和业务能力,从而推动行业的创新发展。在这一背景下,算 力的供给和优化成为推动大模型落地应用的重要因素。 近年来,CPU制程和架构的提升以及 英特尔 ® 高级矩阵扩展AMX(Advanced Matrix Extensions)加速器的面世带来了算力的快速提升。英特尔对大 模型推理等多个AI领域持续深入研究,提供全方位的AI软件支持,兼容主流AI软件且提供多种软件方式提升CPU的AI性 ...