下一个HBM：HBF，能行吗？

Core Viewpoint - The emergence of High Bandwidth Flash (HBF) aims to address the memory bottleneck in artificial intelligence by stacking NAND flash to provide HBM-level bandwidth while achieving a 16-fold capacity increase. However, the practical application of HBF faces significant challenges that may hinder its initial promise [2][30]. Group 1: Background and Challenges - The AI workload bottleneck is no longer computational performance but rather the need for memory to provide data at speeds comparable to NVIDIA's H100, which has a computational capability of 989 TFLOPS. HBM3 meets this requirement with a bandwidth of 819GB/s but has a critical weakness in capacity, with a maximum of 192GB per GPU [5][6]. - The key-value cache (KV cache) for large models like Llama 3.1 405B requires substantial memory, with pre-computed caches needing approximately 540GB for 1 million tokens and 5.4TB for 10 million tokens, making HBM insufficient for such demands [6][11]. - HBF's advantages include a capacity of about 3TB at the same bandwidth of 8TB/s, with NAND costs being approximately one-fifth of HBM, suggesting significant economic benefits [6][8]. Group 2: H³ Architecture and Assumptions - The H³ architecture combines HBM and HBF, acknowledging the limitations of HBF when used alone. It connects HBM directly to the GPU for maximum bandwidth while linking HBF through a daisy chain [8][9]. - The core assumptions of H³ include that most LLM inference data is read-only, the access pattern is deterministic, and a 40MB SRAM buffer can effectively hide the latency of HBF [9][10]. - Simulation results indicate that under ideal conditions, H³ can achieve a throughput increase of 1.25 times for 1 million tokens and 6.14 times for 10 million tokens compared to HBM alone, with a maximum power efficiency improvement of 2.69 times [10][11]. Group 3: Limitations of Assumptions - The assumption that model weights and shared KV caches are read-only is limited in practical LLM services, where frequent updates and model version control are common [11][12]. - The physical limitations of NAND flash, with access delays significantly higher than DRAM, present a fundamental challenge that cannot be overcome by architectural design alone [13][30]. - The cost structure of HBF is complicated by the need for additional components like SRAM and DRAM, which increases the overall system cost despite the lower price of NAND chips [15][16]. Group 4: Alternative Solutions and Market Dynamics - HBF is set to undergo sample testing in 2026-2027, while alternative technologies like HBM4 and CXL memory are rapidly maturing, offering different approaches to memory capacity expansion [20][23][24]. - HBM4 is expected to provide a bandwidth of 1.5TB/s and capacities of 32-48GB, potentially diminishing HBF's capacity advantage [23]. - CXL memory allows for scalable memory pooling across multiple servers, offering significant flexibility and resource utilization improvements, with major industry players already beginning production [24][26]. Group 5: Strategic Importance of HBF - Despite the challenges, HBF represents a strategic shift in the memory industry from commodity supply to platform-based solutions, allowing for greater collaboration with customers and the potential for higher profit margins [28][29]. - The collaboration between SK Hynix and SanDisk in developing HBF technology is a strategic move to explore the integration of storage technologies and platform solutions beyond single product success [29].