大模型Infra新突破！腾讯混元开源LLM推理算子库，推理吞吐提升30%

Core Viewpoint - In the competition of large models, computational efficiency has become a critical bottleneck for AI applications and development, necessitating a shift from merely stacking GPUs to enhancing efficiency [1][7]. Group 1: HPC-Ops Development - Tencent's Mix Yuan AI Infra team has open-sourced a high-performance LLM inference core operator library called HPC-Ops to address performance issues with mainstream operator libraries like H20 [2][15]. - HPC-Ops is built from scratch using CUDA and CuTe, featuring deep architectural adaptations and optimizations to lower the development threshold for core operators, achieving significant performance breakthroughs [4][15]. Group 2: Performance Improvements - The inference performance of the Mix Yuan model has improved by 30% and the DeepSeek model by 17% when utilizing HPC-Ops [5][27]. - HPC-Ops has achieved up to 2.22 times performance improvement in Attention compared to FlashInfer/FlashAttention, 1.88 times in GroupGEMM compared to DeepGEMM, and 1.49 times in FusedMoE compared to TensorRT-LLM [6][47]. Group 3: Pain Points of Existing Operator Libraries - Current mainstream operator libraries are costly to use, complex in design, and require deep familiarity with the code, making adaptation difficult for ordinary AI researchers [11]. - Existing state-of-the-art (SOTA) operator libraries often fail to leverage the full performance potential of hardware, particularly on inference cards like H20, which differ from high-end training cards [8][13]. Group 4: Technical Innovations - HPC-Ops includes modules for FusedMoE, Attention, and GroupGEMM, with optimizations that align task characteristics with hardware capabilities, achieving over 80% of the hardware peak bandwidth [20][47]. - The library employs persistent kernels to hide overhead and uses innovative data rearrangement techniques to enhance performance, achieving superior results compared to current SOTA implementations [24][28]. Group 5: Future Development Directions - HPC-Ops aims to focus on developing sparse Attention operators to address memory and computational bottlenecks in long-context large models and to expand quantization strategies to include mixed precision [50]. - The library will also explore optimization of computation-communication coordination to reduce communication overhead in distributed inference scenarios, supporting the efficient deployment of ultra-large models [51].