破局显存焦虑：新华三推出大模型推理场景加速方案

Group 1 - The core storage supply chain is facing a structural shortage that is expected to persist until 2027, driven by the increasing demands of generative AI and the need for high bandwidth and large capacity GPU memory [2] - The reliance on hardware stacking to address these challenges is not sustainable, leading to increased costs per token and affecting the healthy development of the AI industry [2] - Optimizing the efficiency of key components like GPUs through software and hardware collaboration is essential to mitigate supply chain shortages and reduce overall ownership costs [2] Group 2 - The collaboration between Unisplendour and Pliops aims to address the exponential growth in computational and memory demands of large model inference, which is constrained by the costs and energy efficiency of stacking GPU hardware [3] - The Pliops custom ASIC chip offloads KV Cache from GPU memory to dedicated storage nodes, creating a new memory layer designed for AI, thus alleviating GPU memory pressure [3][4] - This solution supports both single-machine deployment and external storage nodes to enhance the inference performance of multiple AI servers [4] Group 3 - Benchmark tests conducted by Unisplendour on their high-performance AI servers show significant improvements in inference performance when using the KV Cache offloading acceleration scheme [7] - The number of concurrent users supported increased by 200%, while the first token generation delay (TTFT) decreased by 70% and the average token generation delay (TPOT) reduced by 30% [7] Group 4 - The solution is adaptable for various enterprise GenAI applications, including interactive applications like chatbots and intelligent customer service, where quick loading of historical KV Cache enhances user experience [8] - It also effectively handles long-context tasks that require processing thousands of tokens, providing PB-level KV Cache expansion capabilities to avoid performance degradation due to GPU memory limitations [8] - The efficient management of KV Cache allows for higher throughput in online inference services, enabling the system to handle more concurrent requests with the same GPU resources [8] Group 5 - The continuous innovation in inference acceleration is crucial for the future of AI infrastructure, with Unisplendour committed to developing tailored acceleration solutions for various scenarios to help enterprises and developers tackle the complexities of large model applications [9]