Workflow
国泰海通|电子:AI手机的离线推理速度取决于内存带宽瓶颈的突破

Core Viewpoint - The current bottleneck in inference speed is primarily due to memory bandwidth rather than computing power, with the NPU+DRAM stacking technology showing significant improvements in memory bandwidth, indicating a clear industry trend [1][2]. Group 1: Inference Speed and Memory Bandwidth - The Qualcomm Snapdragon 8 GEN 3 has an NPU computing power of approximately 45 TOPs and a memory bandwidth of about 67 GB/s. When running a 7B large model, the calculation capability is limited to approximately 3215 tokens/s by computing power and 4.8 tokens/s by memory bandwidth, with the final speed being constrained by the lower of the two values, highlighting the significant memory bandwidth limitation [2]. - A practical test on a Xiaomi phone using the Qwen3-8B-MNN model showed a decoding speed of 222 tokens with an average response time of 32 seconds, indicating that a user-perceived inference speed should reach 40-50 tokens/s [2]. Group 2: 3D DRAM Solution - The memory limitation for edge AI can be addressed by 3D DRAM. By stacking DRAM and NPU through HB technology, if the memory bandwidth is increased to 800 GB/s, the memory limitation could rise to 57 tokens/s [3]. - Key players in this space include Chinese companies like Zhaoyi Innovation and its subsidiary Qingyun Technology, as well as Taiwanese storage IDM Winbond and mobile AP leader Qualcomm, all focusing on the 3D DRAM+NPU solution, indicating a clear technological trend [3]. Group 3: Hardware and Model Development - The current industry phase suggests that hardware is leading model development, with future growth expected to be driven by model advancements benefiting from hardware improvements. Hardware solutions require extensive stability testing before commercial deployment at scale [3]. - Qualcomm must adopt strategies suitable for AI large model devices to avoid risks associated with a potential "GPU" revolution in mobile AI by the end of 2025 or 2026, as companies prepared with the right hardware and models could experience a significant one-year window of opportunity [3].