PNM(内存附近处理)
Search documents
推理芯片的四种方案,David Patterson撰文
半导体行业观察· 2026-01-19 01:54
Core Insights - The article discusses the challenges and research directions for large language model (LLM) inference hardware, emphasizing the need for innovative solutions to address memory and interconnect limitations rather than computational power [1][3]. Group 1: Challenges in LLM Inference - LLM inference is fundamentally different from training due to the autoregressive decoding phase, which presents significant challenges in memory and interconnect rather than computational capacity [3][5]. - The rapid growth in the use of LLMs has led to increased costs associated with maintaining state-of-the-art models, highlighting the economic feasibility of inference [5][6]. - The emergence of mixture of experts (MoE) models, which utilize multiple experts for selective invocation, exacerbates the memory and communication demands during inference [5][6]. Group 2: Current Limitations of LLM Inference Hardware - Existing GPU/TPU systems for inference are often scaled-down versions of training systems, leading to inefficiencies, particularly in the decoding phase [10][11]. - Memory bandwidth improvements have not kept pace with the increase in floating-point operations per second (FLOPS), with NVIDIA's 64-bit GPU performance increasing 80 times from 2012 to 2022, while memory bandwidth only grew 17 times [12][14]. - The cost of high-bandwidth memory (HBM) has risen significantly, with prices increasing by 1.35 times from 2023 to 2025 due to manufacturing complexities [16][18]. Group 3: Research Directions for LLM Inference Hardware - Four promising research directions are proposed to address the challenges of LLM inference: 1. High Bandwidth Flash (HBF) which can provide 10 times the memory capacity [28]. 2. Processing-Near-Memory (PNM) technologies that enhance memory bandwidth [33]. 3. 3D memory logic stacking to achieve high bandwidth with lower power consumption [37]. 4. Low-latency interconnect solutions to improve communication efficiency [38][40]. Group 4: Performance and Cost Metrics - New performance/cost metrics are emphasized, focusing on total cost of ownership (TCO), average power consumption, and carbon emissions, which provide new targets for system design [25][26]. - The need for efficient scaling of memory bandwidth and capacity, as well as optimizing interconnect speed, is highlighted as critical for LLM decoding performance [26][42]. Group 5: Future Implications - The advancements in LLM inference hardware are expected to foster collaboration across the industry, driving essential innovations for cost-effective AI inference [43].