Workflow
大模型推理
icon
Search documents
华为发布AI推理新技术 中国银联大模型效率提高125倍
Core Viewpoint - Huawei has launched the Unified Cache Manager (UCM), an AI inference memory data management technology aimed at optimizing inference speed, efficiency, and cost in large model inference processes [1][3]. Group 1: UCM Technology Overview - UCM is a KV Cache-centered inference acceleration suite that integrates various caching acceleration algorithms to manage KV Cache memory data generated during inference, thereby expanding the context window for inference [1][3]. - The technology aims to enhance the AI inference experience, improve cost-effectiveness, and accelerate the commercial cycle of AI applications [1][4]. - UCM features a hierarchical adaptive global prefix caching technology that can reduce the latency of the first token by up to 90% [3][6]. Group 2: Industry Application and Impact - In a pilot application with China UnionPay, UCM technology improved large model inference speed by 125 times, allowing for precise identification of customer queries in just 10 seconds [4]. - The financial sector is the first to adopt this technology due to its digital nature and high demands for speed, efficiency, and reliability, making it an ideal testing ground for new AI technologies [4][6]. Group 3: Differentiation and Competitive Advantage - UCM's differentiation lies in its integration of professional storage capabilities, offering a comprehensive lifecycle management mechanism for KV Cache, including preheating, tiering, and elimination [6][7]. - Unlike existing solutions that primarily focus on prefix caching, UCM incorporates a broader range of algorithms, including sparse full-process algorithms and suffix retrieval algorithms, enhancing its reliability and effectiveness [6][7]. - UCM is designed to adapt to various inference scenarios, allowing for smooth optimization across different input and output conditions [6][7]. Group 4: Open Source Initiative and Industry Collaboration - Huawei plans to open source UCM in September, providing a unified interface that can adapt to various inference engines, computing power, and storage systems, promoting collaboration across the industry [7]. - The company aims to address efficiency and cost issues in the AI industry by fostering a collaborative ecosystem among framework vendors, storage providers, and computing power suppliers [7].
大模型推理需求爆发催化推理算力占比上升,科创半导体ETF(588170)开盘冲高大涨1.40%!
Mei Ri Jing Ji Xin Wen· 2025-08-13 02:33
Group 1 - The semiconductor materials and equipment theme index on the STAR Market has seen a strong increase of 1.57%, with notable gains from stocks such as Zhongchuan Special Gas (+20.01%) and Shanghai Hejing (+8.36%) [1] - The STAR Semiconductor ETF (588170) has risen by 4.09% over the past month, with a current price of 1.09 yuan and a trading volume of 33.85 million yuan [1] - The STAR Semiconductor ETF has experienced significant growth in scale, with an increase of 589.47 million yuan over the past week and a rise of 6 million shares [1] Group 2 - IDC predicts that by 2027, the share of inference computing power in China's intelligent computing will rise from approximately 41% in 2023 to 72.6% [2] - The demand for large model inference is expected to double, with a shift in focus towards inference capabilities in infrastructure [2] - Domestic AI capital expenditure is anticipated to maintain rapid growth, supported by regulatory measures aimed at enhancing network and data security [2] Group 3 - The STAR Semiconductor ETF (588170) tracks the semiconductor materials and equipment theme index, focusing on companies in semiconductor equipment (59%) and materials (25%) [3] - The semiconductor materials and equipment industry is a key area for domestic substitution, benefiting from the expansion of semiconductor demand driven by the AI revolution [3]
对话后摩智能CEO吴强:未来90%的数据处理可能会在端边
Guan Cha Zhe Wang· 2025-07-30 06:41
Core Insights - The World Artificial Intelligence Conference (WAIC 2025) highlighted the development of domestic computing power chips, particularly the M50 chip from Houmo Intelligence, designed for large model inference in AI PCs and smart terminals [1][4] - Houmo Intelligence's CEO, Wu Qiang, emphasized a shift in the focus of large models from training to inference, and from cloud intelligence to edge and endpoint intelligence [1][4] Company Overview - Houmo Intelligence was founded in 2020, focusing on high-performance AI chip development based on integrated storage and computing technology [3] - The M50 chip is seen as a significant achievement for Houmo Intelligence, showcasing their advancements over the past two years [3] Product Specifications - The M50 chip delivers 160 TOPS INT8 and 100 TFLOPS bFP16 physical computing power, with a maximum memory of 48GB and a bandwidth of 153.6 GB/s, while maintaining a typical power consumption of only 10W [4] - The product matrix from Houmo Intelligence covers a range of computing solutions from edge to endpoint, including the LQ50 Duo M.2 card for AI PCs and companion robots [4] Market Positioning - Wu Qiang stated that domestic companies should adopt differentiated technological paths rather than directly copying international giants like NVIDIA and AMD [4] - Houmo Intelligence aims to integrate storage and computing technology with large models to enable offline usability and data privacy [4] Future Developments - The release of the M50 chip is viewed as a starting point, with plans for more chips to address computing power, power consumption, and bandwidth issues in edge and endpoint AI computing [5] - Houmo Intelligence has initiated research on next-generation DRAM-PIM technology, which aims to achieve 1TB/s on-chip bandwidth and triple the energy efficiency of current levels [9] Target Markets - The M50 chip is applicable in various fields, including consumer terminals, smart offices, and smart industries, with a focus on offline processing to mitigate data transmission risks [8] - Potential clients include Lenovo's next-generation AI PC, iFlytek's smart voice devices, and China Mobile's new 5G+AI edge computing equipment [8]
斯坦福大模型推理课免费了,谷歌推理团队创始人主讲
量子位· 2025-07-25 07:59
Core Viewpoint - The article discusses the reasoning capabilities of large language models (LLMs) and emphasizes the importance of intermediate reasoning steps in enhancing model confidence and accuracy in problem-solving [5][10][34]. Group 1: Importance of Reasoning in LLMs - Reasoning in LLMs refers to the intermediate thought processes that occur before arriving at a final answer, which can significantly improve the model's ability to solve complex problems [5][11]. - Introducing a chain of thought (CoT) allows LLMs to tackle inherently serial problems without needing to expand the model size, thus bridging the gap between Transformers and Turing machines [12][13]. - The presence of reasoning steps increases the accuracy and reliability of answers, reducing the likelihood of random guessing [14][17]. Group 2: Enhancing Model Confidence - Answers derived from reasoning processes lead to greater confidence in the model's outputs, as they are based on logical deductions rather than mere guesses [19][20]. - Denny Zhou highlights that pre-trained models possess reasoning capabilities even without fine-tuning, although these outputs may not be prioritized in greedy decoding [21][24]. Group 3: Methods to Improve Reasoning - The CoT-decoding method selects reasoning paths from top-k alternatives, enhancing performance on reasoning tasks and approaching the effectiveness of instruction-tuned models [26]. - Supervised fine-tuning (SFT) involves training models on human-written step-by-step problems, but it may lack generalization across new scenarios [27][28]. - Reinforcement learning fine-tuning has emerged as a powerful method for eliciting reasoning, focusing on generating longer responses and improving model performance through iterative training [31]. Group 4: Future Directions - Denny Zhou identifies key areas for future breakthroughs, including addressing tasks with non-unique verifiable answers and developing practical applications beyond benchmark testing [35][40].
AI真的需要「像人类」那样思考吗?AlphaOne揭示属于大模型的「思考之道」
机器之心· 2025-06-23 07:44
Core Viewpoint - The article discusses a new reasoning framework called AlphaOne, which suggests that AI models should adopt a "slow thinking first, fast thinking later" approach during testing, contrasting with the traditional human-like reasoning paradigm [4][5][6]. Group 1: Introduction of AlphaOne - AlphaOne introduces a global reasoning control hyperparameter α that allows models to switch from slow to fast reasoning without additional training, significantly improving reasoning accuracy and efficiency [6][12]. - The framework challenges the assumption that AI must think like humans, proposing a more effective reasoning strategy [6][4]. Group 2: Mechanism of AlphaOne - The core mechanism of AlphaOne involves the introduction of a unified control point called α-moment, which dictates when to transition from slow to fast thinking [16][18]. - Prior to the α-moment, the model uses a probability-driven strategy to guide deep reasoning, while after the α-moment, it switches to a fast thinking mode [20][24]. Group 3: Experimental Results - In experiments across six reasoning tasks, AlphaOne demonstrated superior accuracy compared to existing models, with a notable increase of +6.15% in accuracy for a 1.5 billion parameter model [28][29]. - Despite employing a slow thinking mechanism, AlphaOne reduced the average number of generated tokens by 14%, showcasing its efficiency [30]. Group 4: Scalability and Flexibility - The α-moment allows for scalable adjustments to the thinking phase length, with the ability to increase or decrease the number of slow thinking markers based on the α value [34]. - The framework maintains robust performance across a wide range of α values, indicating its generalizability [34]. Group 5: Future Directions - The article suggests potential future research directions, including the development of more complex slow thinking scheduling strategies and the exploration of cross-modal reasoning applications [46][48].
半壁江山都来了!中国AI算力大会演讲嘉宾全揭晓,同期异构混训、超节点两大研讨会议程公布
傅里叶的猫· 2025-06-17 15:30
Core Viewpoint - The 2025 China AI Computing Power Conference will be held on June 26 in Beijing, focusing on the evolving landscape of AI computing power driven by DeepSeek technology [1][2]. Group 1: Conference Overview - The conference will feature nearly 30 prominent speakers delivering keynotes, reports, and discussions on AI computing power [1]. - It includes a main venue for high-level forums and specialized discussions, as well as closed-door workshops for select attendees [2]. Group 2: Keynote Speakers - Notable speakers include Li Wei from the China Academy of Information and Communications Technology, who will discuss cloud computing standards [4][8]. - Wang Hua, Vice President of Moore Threads, will present on training large models using FP8 precision [12][13]. - Yang Gongyifan, CEO of Zhonghao Xinying, will share insights on high-end chip design and development [14][16]. - Xu Lingjie, CEO of Magik Compute, will address the evolution of compilation technology in AI infrastructure [18][22]. - Chen Xianglin from Qujing Technology will discuss innovations in optimizing large model inference [28][31]. Group 3: Specialized Forums - The conference will host specialized forums on AI inference computing power and smart computing centers, featuring industry leaders discussing cutting-edge technologies [2][4]. - The closed-door workshops will focus on heterogeneous training technologies and supernode technologies, aimed at industry professionals [2][67][71]. Group 4: Ticketing and Participation - The conference offers various ticket types, including free audience tickets and paid VIP tickets, with an application process for attendance [72].
10% KV Cache实现无损数学推理!这个开源方法解决推理大模型「记忆过载」难题
量子位· 2025-06-16 04:49
Core Viewpoint - The introduction of R-KV offers a highly efficient compression method that transforms the "rambling" of large models into controllable memory entries, significantly reducing memory usage by 90%, increasing throughput by 6.6 times, and maintaining 100% accuracy [1]. Group 1: R-KV Methodology - R-KV employs a three-step process: redundancy identification, importance assessment, and dynamic elimination to address the redundancy issue in large model reasoning [5]. - The method allows for real-time compression of key/value (KV) caches during model decoding, retaining only important and non-redundant tokens [7]. - R-KV utilizes a combination of importance scoring and redundancy filtering to preserve critical context while eliminating noise, leading to successful task completion [15]. Group 2: Performance Metrics - In performance tests, R-KV significantly outperformed baseline methods in challenging mathematical benchmark tests, achieving accuracy rates of 34% for R1-Llama-8B and 54% for R1-Qwen-14B on the MATH-500 dataset [19]. - R-KV demonstrated substantial memory savings and throughput improvements, with a maximum memory saving of 90% and a throughput of 2525.75 tokens per second [20][21]. - The method allows for larger batch processing sizes without sacrificing task performance, indicating its efficiency in handling extensive reasoning tasks [21]. Group 3: Application Scenarios - R-KV is suitable for edge devices requiring long-chain reasoning, enabling even consumer-grade GPUs and mobile NPUs to run complex models [22]. - The method can accelerate reinforcement learning sampling processes and is designed to be plug-and-play, requiring no training [22].
SGLang 推理引擎的技术要点与部署实践|AICon 北京站前瞻
AI前线· 2025-06-13 06:42
Core Insights - SGLang has gained significant traction in the open-source community, achieving nearly 15K stars on GitHub and over 100,000 monthly downloads by June 2025, indicating its popularity and performance [1] - Major industry players such as xAI, Microsoft Azure, NVIDIA, and AMD have adopted SGLang for their production environments, showcasing its reliability and effectiveness [1] - The introduction of a fully open-source large-scale expert parallel deployment solution by SGLang in May 2025 is noted as the only one capable of replicating the performance and cost outlined in the official blog [1] Technical Advantages - The core advantages of SGLang include high-performance implementation and easily modifiable code, which differentiates it from other open-source solutions [3] - Key technologies such as PD separation, speculative decoding, and KV cache offloading have been developed to enhance performance and resource utilization while reducing costs [4][6] Community and Development - The SGLang community plays a crucial role in driving technological evolution and application deployment, with over 100,000 GPU-scale industrial deployment experiences guiding technical advancements [5] - The open-source nature of SGLang encourages widespread participation and contribution, fostering a sense of community and accelerating application implementation [5] Performance Optimization Techniques - PD separation addresses latency fluctuations caused by prefill interruptions during decoding, leading to more stable and uniform decoding delays [6] - Speculative decoding aims to reduce decoding latency by predicting multiple tokens at once, significantly enhancing decoding speed [6] - KV cache offloading allows for the storage of previously computed KV caches in larger storage devices, reducing computation time and response delays in multi-turn dialogues [6] Deployment Challenges - Developers often overlook the importance of tuning numerous configuration parameters, which can significantly impact deployment efficiency despite having substantial computational resources [7] - The complexity of parallel deployment technologies presents compatibility challenges, requiring careful management of resources and load balancing [4][7] Future Directions - The increasing scale of models necessitates the use of more GPUs and efficient parallel strategies for high-performance, low-cost deployments [7] - The upcoming AICon event in Beijing will focus on AI technology advancements and industry applications, providing a platform for further exploration of these topics [8]
大模型推理,得讲性价比
虎嗅APP· 2025-06-06 10:10
Core Insights - The article discusses the evolution and optimization of the Mixture of Experts (MoE) model, highlighting Huawei's innovative MoGE architecture that addresses inefficiencies in the original MoE model and enhances cost-effectiveness and deployment ease [1][3]. Group 1: MoE Model Evolution - The MoE model has become a key path for improving large model inference efficiency due to its dynamic sparse computing advantages [3]. - Huawei's Pangu Pro MoE 72B model significantly reduces computational costs and ranks first domestically in the SuperCLUE benchmark for models with over 100 billion parameters [3]. - The Pangu Pro MoE model achieves a 6-8 times improvement in inference performance through system-level optimizations and can reach a throughput of 321 tokens/s on the Ascend 300I Duo [3][30]. Group 2: Optimization Strategies - Huawei's H2P (Hierarchical & Hybrid Parallelism) strategy enhances inference efficiency by allowing specialized communication within task-specific groups rather than a "full team meeting" approach [5][6]. - The TopoComm optimization focuses on reducing communication overhead and improving data transmission efficiency, achieving a 35% reduction in synchronization operations [8][10]. - The DuoStream optimization allows for concurrent execution of communication and computation, significantly improving inference efficiency [11]. Group 3: Operator Fusion - Huawei has developed two specialized fusion operators, MulAttention and SwiftGMM, to optimize resource access and computation scheduling, leading to substantial performance improvements [15][17]. - MulAttention enhances attention computation speed by 4.5 times and achieves over 89% data transfer efficiency [17]. - SwiftGMM accelerates GMM computation by 2.1 times and reduces end-to-end inference latency by 48.7% [20]. Group 4: Inference Algorithm Acceleration - The PreMoE algorithm dynamically prunes experts in the MoE model, improving throughput by over 10% while maintaining accuracy [25]. - The TrimR algorithm reduces unnecessary inference steps by 14% by monitoring and adjusting the model's reasoning process [26]. - The SpecReason algorithm leverages smaller models to enhance the efficiency of larger models, resulting in a 30% increase in throughput [27]. Group 5: Performance Breakthroughs - The Ascend 800I A2 platform demonstrates exceptional performance with a single-card throughput of 1528 tokens/s under optimal conditions [30][31]. - The Ascend 300I Duo platform offers a cost-effective solution for MoE model inference, achieving a maximum throughput of 321 tokens/s [32][33]. - Overall, Huawei's optimizations have established a robust foundation for high-performance, large-scale, and low-cost inference capabilities [33].
MoE推理「王炸」组合:昇腾×盘古让推理性能狂飙6-8倍
机器之心· 2025-06-06 09:36
Core Viewpoint - The article emphasizes the significant advancements in the Pangu Pro MoE 72B model developed by Huawei, highlighting its efficiency in large model inference through innovative techniques and optimizations, which have led to substantial performance improvements in AI applications [2][23]. Group 1: Model Performance and Optimization - The Pangu Pro MoE model achieves a 6-8 times improvement in inference performance through system-level optimizations, including high-performance operator fusion and model-native speculative algorithms [3][23]. - The model's throughput reaches 321 tokens/s on the Ascend 300I Duo and can soar to 1528 tokens/s on the Ascend 800I A2, showcasing its capability to fully leverage hardware potential [3][24]. Group 2: Hierarchical and Hybrid Parallelism - Huawei introduces a novel Hierarchical & Hybrid Parallelism (H P) strategy, which enhances efficiency by allowing specialized communication and computation without the need for all components to engage simultaneously [6][7]. - This strategy results in a 33.1% increase in decode throughput compared to traditional parallel processing methods [7]. Group 3: Communication Optimization - The TopoComm optimization scheme reduces static overhead and improves data transmission efficiency, achieving a 35% reduction in synchronization operations and a 21% increase in effective bandwidth [9][12]. - The introduction of mixed quantization communication strategies leads to a 25% reduction in communication data size and a 39% decrease in AllGather communication time [9]. Group 4: Operator Fusion and Efficiency - The development of fusion operators like MulAttention and SwiftGMM addresses the inefficiencies of traditional operators, significantly enhancing memory access and computation scheduling [15][18]. - MulAttention achieves a 4.5 times acceleration in attention computation, while SwiftGMM reduces inference latency by 48.7% [16][18]. Group 5: Dynamic Pruning and Collaborative Optimization - The PreMoE dynamic pruning algorithm enhances inference throughput by over 10% by selectively activating relevant experts for specific tasks [21]. - The TrimR and SpecReason algorithms optimize the reasoning process, reducing unnecessary computation and improving throughput by 30% [20][22]. Group 6: Overall System Optimization - The comprehensive optimization of the Ascend Pangu inference system establishes a robust foundation for high-performance, large-scale, and cost-effective AI model deployment [28].