Investment Rating - The investment rating for the industry is "Positive" and is maintained [6] Core Insights - On February 24, 2025, DeepSeek open-sourced the FlashMLA decoding kernel, which is a high-efficiency inference engine specifically designed for NVIDIA's Hopper architecture GPU. This kernel has been practically applied in DeepSeek, significantly enhancing the inference efficiency of LLMs through KV cache compression and GPU computational optimization [2][4] Summary by Sections Event Description - The FlashMLA decoding kernel was launched on February 24, 2025, and is tailored for NVIDIA's Hopper architecture GPU. It has been effectively utilized in DeepSeek, leading to a notable improvement in LLM inference efficiency [4] Event Commentary - FlashMLA employs a paged KV cache with a batch size of 64, addressing the performance bottleneck associated with increased memory usage and computational costs as sequence lengths grow in LLMs. The Multi-head Latent Attention (MLA) technique projects keys and values into a lower-dimensional latent space, reducing the size of the KV cache while maintaining model performance, thus accelerating the inference process [9] - The paged KV cache further enhances DeepSeek's inference efficiency by allowing data to be divided into manageable batches, improving memory efficiency and reducing decoding latency, making it suitable for deployment on edge devices. On the H800 SXM5 compute chip running CUDA 12.6, FlashMLA can achieve up to 3000 GB/s under memory bandwidth constraints and 580 TFLOPS under computational constraints [9] - FlashMLA supports BF16 (Brain Float 16) precision, which reduces memory usage and speeds up computation while maintaining sufficient accuracy for most AI tasks, facilitating the deployment of large models in resource-constrained environments. This support broadens DeepSeek's application scenarios, enhancing its performance in long-sequence language inference, making it more applicable for document analysis or extended dialogues [9] - The new wave of technological supply revolution is leading to a revaluation of the domestic AI industry. DeepSeek is expected to significantly accelerate the application deployment speed and expand AI computing demand. Key areas to focus on include: 1) the domestic inference computing industry chain, particularly recommending leading AI chip company Cambrian; 2) cloud service providers, especially those collaborating with DeepSeek; 3) IDC firms working with major companies like Tencent, Alibaba, and ByteDance; 4) AI application-related targets, particularly in AI+ government, AI+ finance, AI+ healthcare, and AI+ education [9]
软件与服务:AI产业速递:DeepSeek开源FlashMLA解码内核,持续提升LLM推理效率
长江证券·2025-02-27 01:43