大模型推理
Search documents
平价数码产品,要和我们说再见了?
虎嗅APP· 2025-12-15 10:26
Core Viewpoint - The duration and intensity of memory price increases are likely underestimated, with global DRAM shortages expected to persist until the end of 2028 according to an internal SK Hynix report [2][3]. Group 1: Market Dynamics - The memory price surge is driven by a significant imbalance between supply and demand, primarily due to the explosive growth in AI chip demand and the transition of large models into the inference stage [4][5]. - The price of DDR5 16GB (5600MHz) memory modules has increased from around 300 yuan at the beginning of the year to a minimum of 899 yuan, representing a 200% increase within less than a year [4]. - LPDDR5X prices are also rising, with a reported 83% increase in Q4 compared to the beginning of the year [4]. Group 2: Supply Chain Adjustments - Major memory manufacturers are reallocating production capacity towards higher-margin HBM chips, leading to reduced supply of DDR memory for consumer electronics and traditional servers [5][6]. - AI chip manufacturers are increasingly opting for DDR memory over HBM in certain applications, as DDR can meet the requirements for context inference tasks at a lower cost [6][7]. Group 3: Consumer Impact - Consumers can expect widespread price increases for various digital products, including smartphones and computers, with unpredictable magnitude due to the ongoing memory price cycle [10][11]. - The cost difference between HBM and DDR memory is significant, with HBM3e costing approximately $10-17 per GB compared to $1.5-2.2 per GB for DDR5, influencing manufacturers' production priorities [10][11]. - The trend indicates a reduction in relatively affordable electronic products in the market, as manufacturers may shift to optimizing product lines rather than simply raising prices [11][12].
基于 SGlang RBG + Mooncake 打造生产级云原生大模型推理平台
AI前线· 2025-12-12 00:40
Core Insights - The article emphasizes the rapid evolution of large language model (LLM) inference services into core enterprise infrastructure, focusing on the balance of performance, stability, and cost in building high-performance inference systems [2] - It discusses the transition from monolithic to distributed architectures in LLM inference, highlighting the need for external KVCache to alleviate memory pressure and enhance performance in high-demand scenarios [2][4] Distributed KVCache and Mooncake - Mooncake is introduced as a leading distributed KVCache storage engine designed to provide high throughput and low latency for inference frameworks like SGLang [3] - The article outlines the challenges in managing distributed KVCache systems in production environments, which necessitate the development of RoleBasedGroup (RBG) for unified management of caching and inference nodes [4] RoleBasedGroup (RBG) Design and Challenges - RBG is presented as a Kubernetes-native API aimed at AI inference, facilitating multi-role orchestration to ensure stable and high-performance operations [4][12] - The article identifies five fundamental challenges in deploying large model inference services, including the need for strong state management and performance optimization [12][15] SCOPE Framework - The SCOPE framework is introduced, focusing on five core capabilities: Stability, Coordination, Orchestration, Performance, and Extensibility, which are essential for managing LLM inference services [16][18] - RBG's design allows for rapid architecture iteration and performance-sensitive operations, addressing the complexities of multi-role dependencies and operational efficiency [15][24] Benchmark Testing and Performance Metrics - Benchmark tests demonstrate significant improvements in KVCache hit rates and inference performance, with L3 Mooncake cache achieving a 64.67% hit rate and reducing average TTFT to 2.58 seconds [32][48] - The article highlights the importance of a multi-tier caching architecture in enhancing performance for applications like multi-turn dialogue and AI agents [44] Conclusion and Future Outlook - The integration of RBG and Mooncake is positioned as a transformative approach to building production-grade LLM inference services, emphasizing the need for deep integration of high-performance design with cloud-native operational capabilities [43][44] - The article concludes with a call for community collaboration to advance this paradigm and lay the foundation for the next generation of AI infrastructure [43]
当算力追赶不上智能:2026年AI行业的缺口与爆发(附86页PPT)
材料汇· 2025-12-10 15:51
Core Insights - The rapid evolution of AI is outpacing the development of computing infrastructure, leading to a significant gap in computing power that is expected to widen by 2026. This gap will manifest in two key areas: a growing demand for core computing capabilities across chips, storage, packaging, and cooling, and a shift towards edge computing to reduce cloud latency and costs, resulting in an explosion of applications from AI smartphones to integrated robots [1]. Industry Overview - The electronic sector has reached a record high in Q3 2025, driven by AI, with the electronic index rising by 44.5% year-to-date, outperforming the CSI 300 index by 26.6% [12][13]. - The semiconductor sector has shown significant growth, with various sub-sectors experiencing substantial increases: PCB (+114%), consumer electronics (+51%), and semiconductors (+40%) year-to-date [12][13]. - The overall electronic industry reported a revenue increase of 19% and a net profit increase of 35% in Q1-Q3 2025, with all major segments showing positive growth [18][24]. Performance Metrics - The electronic sector's inventory levels have risen, particularly in consumer electronics and PCBs, indicating strong demand and recovery in terminal markets [22][25]. - The semiconductor sector's monthly sales growth has rebounded since June 2023, with a notable increase in demand for digital, storage, and equipment segments [34][41]. AI Impact on Semiconductor Cycle - The semiconductor market is entering an upward cycle, with significant growth in capital expenditures from both domestic and international cloud service providers, driven by AI demand [41][42]. - Major cloud providers are expected to increase their capital expenditures significantly, with projections indicating a 50%-60% growth in 2026 [43]. Consumer Electronics Trends - Global smartphone sales are projected to recover, with a forecast of 1.29 billion units in 2024, reflecting a 6.1% year-on-year increase [26][27]. - The PC market is also expected to grow, with global sales reaching 263 million units in 2024, a 1.0% increase year-on-year [27][29]. Automotive Sector Insights - The automotive market is experiencing a weak recovery, with global sales expected to reach 92.23 million units in 2025, reflecting a 1.8% year-on-year increase [39]. - The penetration rate of electric vehicles is projected to rise, with expectations of 20% in 2025 for global sales [39]. AI Narrative Acceleration - The competition among AI model developers has intensified, with significant advancements in model capabilities and applications across various sectors [47][50]. - The demand for AI-related spending is expected to reach $3-4 trillion by 2030, driven by the need for enhanced computing power and applications [58]. Edge Computing and Hardware Development - The shift towards edge computing is becoming crucial, with predictions indicating that the global edge AI market will grow to ¥1.2 trillion by 2029, with a CAGR of 39.6% [69]. - Major AI companies are actively entering the edge hardware market to enhance user experience and profitability [69].
NeurIPS 2025 | DynaAct:DeepSeek R1之外,探索大模型推理的另一条道路
机器之心· 2025-11-29 09:33
Core Insights - The article discusses the emergence of a new paradigm in large model reasoning, shifting from train-time scaling to test-time scaling (TTS), emphasizing the need for efficient inference rather than merely longer reasoning chains [3][10]. - The research team from Ant Group and the University of Hong Kong introduces DynaAct, a novel approach that focuses on dynamic action space optimization to enhance reasoning efficiency [4][7]. Group 1: DynaAct Overview - DynaAct is based on the principle of Action Space Optimization, which dynamically constructs a set of selectable actions at each reasoning step, allowing for more structured and efficient inference [7][11]. - The core idea of DynaAct is to transform the action space learning problem into a set selection problem, utilizing submodular optimization to achieve linear complexity algorithms [14]. Group 2: Methodology and Implementation - DynaAct employs a submodular function that includes utility and diversity components, measuring the similarity of the action space to the current state and the redundancy of actions within the action space [14]. - The implementation of DynaAct is supported by a high-performance Monte Carlo Tree Search (MCTS) framework, which significantly enhances the efficiency of node expansion, rollout, and reward calculation [19]. Group 3: Performance and Results - DynaAct outperforms traditional methods such as CoT, RAP, and rStar across six reasoning benchmarks, demonstrating the effectiveness of dynamic action spaces [21]. - Evaluation results indicate that DynaAct achieves a score of 70.22 on the MMLU benchmark, surpassing other models, and shows a stable test-time scaling trend with increased MCTS rollout iterations [22][25]. Group 4: Future Directions - The research team plans to explore the extension of dynamic action spaces to multi-agent planning scenarios and to combine submodular optimization with reinforcement learning for adaptive reasoning strategies [26].
华为放出「准万亿级MoE推理」大招,两大杀手级优化技术直接开源
机器之心· 2025-11-28 04:11
机器之心报道 编辑:杜伟 2025 年已接近尾声,这一年里,大模型加速从单点提效工具升级为支撑业务系统的底层基础设施。过程中,推理效率决定了大模型能否真正 落地。对于超大规模 MoE 模型,复杂推理链路带来了计算、通信、访存等方面的挑战,亟需行业给出高效可控的推理路径。 华为亮出了面向准万亿参数 MoE 推理的完整技术栈:openPangu-Ultra-MoE-718B-V1.1 展现 MoE 架构的模型潜力、 包括 Omni Proxy 调度特 性、将昇腾硬件算力利用率推至 86% 的 AMLA 技术在内的昇腾亲和加速技术, 使得超大规模 MoE 模型具备了走向生产级部署的现实可行 性。开源实现: https://gitcode.com/ascend-tribe/ascend-inference-cluster# 如果说过去数年大模型竞争的焦点在训练规模与能力突破上,那么如今,推理效率正迅速成为影响模型能否落地的关键变量。 模型 GitCode 地址:https://ai.gitcode.com/ascend-tribe/openPangu-Ultra-MoE-718B-V1.1-Int8 从任务属性来看, ...
芯片ETF(512760)连续5日净流入超4亿元,英伟达公布芯片出货预期
Mei Ri Jing Ji Xin Wen· 2025-11-05 07:05
Group 1 - Nvidia announced the release of Blackwell and Rubin architecture cabinet solutions at GTC2025, with the first-generation Rubin NVL144 showing a performance improvement of approximately 3.3 times over the GB300 NVL72, and the second-generation Rubin Ultra576 expected to enhance performance by about 14 times, set to launch in the second half of 2027 [1] - The Vera Rubin Superchip was showcased, featuring an 88-core Arm CPU, dual Rubin GPUs, and 2TB of memory, achieving a computing power of 100 PFLOPS; additionally, the CPX computing board was presented, supporting context acceleration for over one million tokens, aimed at large model inference scenarios [1] - Nvidia anticipates GPU sales exceeding $500 billion over the next five quarters, with projected lifecycle shipments of Blackwell and Rubin reaching 20 million units, significantly higher than the 4 million units of Hopper [1] Group 2 - Nvidia will collaborate with Oracle and the U.S. Department of Energy to build the Solstice and Equinox supercomputing systems, deploying 100,000 and 10,000 Blackwell GPUs respectively, with a total computing power of approximately 2200 EFLOPS, expected to be operational by the first half of 2026 [1] - The chip ETF (512760) tracks the China Semiconductor Chip Index (990001), which selects listed companies involved in semiconductor chip materials, equipment, design, manufacturing, packaging, and testing from the Shanghai and Shenzhen markets to reflect the overall performance of related listed companies in the semiconductor industry [1]
锦秋基金参与微纳核芯超亿元融资,首创三维存算一体3D-CIM™芯片开启大模型推理新篇章|Jinqiu Spotlight
锦秋集· 2025-10-30 13:34
Core Insights - Jinqiu Fund has completed an investment in Micro Nano Core, a leading AI chip company specializing in 3D integrated computing-in-memory (CIM) technology [2][4][12] - The investment highlights the market's strong consensus on the disruptive potential of 3D-CIM™ technology in AI computing applications, aiming to enhance edge AI capabilities [4][6] - The global market for edge AI chips is projected to grow significantly, from $2 billion in 2024 to $16.7 billion by 2028, driven by the increasing demand for high-performance, low-power, and cost-effective solutions [6][7] Investment Overview - Micro Nano Core has successfully completed over 100 million yuan in Series B strategic financing, led by BlueRun Ventures, with participation from top-tier institutions including Jinqiu Fund [4] - The funding will support the development of the world's fastest mass-producible 3D edge AI chips, which are expected to unlock new AI application scenarios [4][9] Technology and Innovation - Micro Nano Core's 3D-CIM™ chip technology combines 3D near-memory computing and computing-in-memory (CIM) to achieve significant improvements in performance, power efficiency, and cost [8][12] - The company claims to have achieved over 4 times the computing density and more than 10 times the power reduction compared to traditional von Neumann architectures [8][9] - The RV-CIM™ full-stack technology addresses the usability of chips, ensuring compatibility with the RISC-V open-source ecosystem [8][10] Market Trends - The evolution of AI agents from execution tools to decision-making partners is expected to drive a revolution in various industries, with a target of 70% penetration of intelligent terminals by 2027 in China [6] - The demand for high-performance, low-power, and low-cost chips is critical for the widespread adoption of AI agents, with the industry consensus leaning towards 3D stacking as a solution [7][8] Team and Ecosystem - Micro Nano Core boasts a world-class innovation team with a strong track record in chip design, having published numerous record-breaking results in international competitions [10][13] - The company is leading the development of global RISC-V CIM standards and is collaborating with multiple industry leaders to build a robust ecosystem around its technology [11][13] - The strategic partnerships with major storage and terminal manufacturers aim to create a self-sustaining ecosystem that enhances the adoption of 3D-CIM™ chips across various applications [11][12]
独家|对话Tensormesh三位联创:如何从学术界走到大模型推理产业前线?
Z Potentials· 2025-10-24 08:18
Core Insights - Tensormesh, a company focused on providing cache-accelerated inference optimization for enterprises, has officially launched and secured $4.5 million in seed funding led by Laude Ventures [2] - The founding team, consisting of Junchen Jiang, Yihua Cheng, and Kuntai Du, aims to bridge the gap between AI inference engines and storage services, leveraging their academic backgrounds to create a commercially viable product [3][4] Company Overview - Tensormesh is the first commercial platform to productize large-scale AI inference caching, inspired by the open-source project LMCache, which combines advanced technology with enterprise-level usability, security, and manageability [2][4] - The company’s product allows enterprises to deploy large model services easily, significantly reducing operational costs to about one-tenth of public API usage while enhancing performance by up to ten times compared to mainstream solutions [4][29] Funding and Growth - The funding process for Tensormesh was unconventional, relying on personal connections rather than traditional methods like business plans or roadshows, resulting in a swift investment agreement [5][48] - The seed funding will primarily be used for product refinement and team expansion, with a strategic focus on creating a strong open-source engine as an entry point for commercial value [5][40] Market Position and Challenges - The inference industry is emerging, with the cost of inference surpassing training costs due to increased usage, highlighting the need for efficient solutions [30][32] - Tensormesh addresses three main challenges in deploying large models: privacy concerns, complex cluster management, and high operational costs [26][28] Product Features and Innovations - The product offers a one-click deployment solution for in-house large model services, ensuring data privacy while significantly lowering costs and improving performance [29][30] - Tensormesh aims to fill a market gap by providing a comprehensive solution that integrates inference engines, storage, scheduling, and routing, which is currently lacking in the industry [38] Future Aspirations - The company aspires to become the go-to solution for large model inference, similar to how Databricks is recognized in big data [44][45] - The long-term vision includes evolving with AI advancements, ensuring that Tensormesh remains relevant as the industry shifts from reliance on single models to more complex systems [51][52]
KTransformers入选计算机系统顶会、与主流框架合作,趋境&清华让「异构」成为推理新范式
量子位· 2025-10-22 09:12
Core Insights - KTransformers, an open-source project developed by Turing Technology and Tsinghua University's KVCache.AI team, focuses on system innovation during the inference phase of large models, enabling efficient operation on diverse hardware architectures with lower computational power [2][4]. Group 1: KTransformers Overview - KTransformers is a high-performance heterogeneous inference framework that optimally utilizes various computing resources such as GPUs, CPUs, and memory [2]. - The project paper was recognized at the prestigious SOSP 2025 conference, highlighting its significance in the field of computer systems [2][4]. Group 2: Technical Innovations - The framework introduces an "Expert Deferral" mechanism, allowing for efficient scheduling of experts in Mixture of Experts (MoE) models, which reduces computational load without sacrificing model performance [7][13]. - KTransformers achieves nearly 4x speedup on a single Intel Xeon processor compared to traditional PyTorch implementations, significantly enhancing CPU performance in expert calculations [12]. - The system allows for dynamic overlapping of CPU and GPU loads, resulting in a model throughput increase of approximately 1.45 times, with minimal impact on model accuracy [15][16]. Group 3: Collaboration and Ecosystem - KTransformers has partnered with SGLang, a mainstream inference framework, to integrate full GPU inference with heterogeneous inference, enhancing the overall architecture for large model deployment [5][19]. - This collaboration enables developers to access both full GPU and heterogeneous inference capabilities seamlessly, particularly beneficial in scenarios with limited GPU resources [21]. Group 4: Market Position and Future Directions - KTransformers has gained significant traction in the developer community, with over 15.2K stars on GitHub, indicating its widespread adoption as a foundational framework for large model inference [24]. - The project aims to democratize AI capabilities, making them accessible beyond elite computational paths, and is actively collaborating with various domestic CPU and GPU platforms to promote cost-effective solutions [28][29].
清华、快手提出AttnRL:让大模型用「注意力」探索
机器之心· 2025-10-21 09:32
Core Insights - The article discusses the advancements in reinforcement learning (RL), particularly focusing on Process-Supervised RL (PSRL) and the introduction of a new framework called AttnRL, which enhances exploration efficiency and performance in reasoning models [3][4][9]. Group 1: Challenges in Traditional Methods - Traditional PSRL methods assign equal reward signals to all tokens, neglecting the fine-grained quality during the reasoning process [7]. - Existing PSRL approaches face significant bottlenecks in exploration efficiency and training costs, leading to high computational expenses [4][10]. Group 2: Introduction of AttnRL - AttnRL introduces an innovative exploration method by utilizing attention mechanisms to guide the reasoning process, allowing the model to branch from high-attention steps [9][12]. - The framework employs Attention-based Tree Branching (ATB), which analyzes the reasoning sequence and calculates Forward Context Influence (FCI) scores to determine the most impactful steps for branching [13][16]. Group 3: Adaptive Sampling Mechanisms - AttnRL incorporates two adaptive sampling mechanisms: difficulty-aware exploration and dynamic batch adjustment, optimizing the learning process by focusing on challenging problems while reducing computational load on simpler ones [20][22]. - The training process is streamlined to a One-Step Off-Policy approach, significantly reducing sampling costs compared to previous PSRL methods [23]. Group 4: Experimental Results - AttnRL demonstrates superior performance across various mathematical reasoning benchmarks, achieving average accuracy rates of 57.2% for 1.5B models and 68.7% for 7B models, outperforming baseline methods like GRPO and TreeRL [28]. - The framework shows improved efficiency in sampling, with a higher effective ratio and better performance in fewer training steps compared to traditional methods [29][31]. Group 5: Future Outlook - The introduction of attention scores in PSRL exploration decisions opens new avenues for enhancing model interpretability and RL research, suggesting that efficiency and intelligence can coexist through more effective exploration strategies [34].