大模型推理成本降低，AI应用落地可期

Group 1: API Price Reduction - The introduction of innovative architectures like MLA in DeepSeek-V2 has reduced input and output prices to 1 RMB and 2 RMB per million tokens, respectively, attracting significant industry attention[37] - Major players like ByteDance and Alibaba have followed suit, lowering flagship model API prices to below 10 RMB per million tokens, with lightweight models dropping to under 1 RMB per million tokens[44] - The overall trend of API price reduction is driven by innovations in model architecture, inference engines, chip cost-performance improvements, and parameter quantization, leading to a significant optimization in inference costs[46] Group 2: MoE Architecture - The Mixture of Experts (MoE) architecture allows for efficient scaling of model parameters and reduction of computational costs by integrating multiple expert models and gating networks[38] - DeepSeek-V2 and Snowflake's Arctic model utilize fine-grained expert segmentation and shared expert isolation mechanisms, significantly enhancing parameter efficiency[76] - The next step in MoE research is to develop more heterogeneous architectures that can dynamically adjust computational costs based on task complexity, improving model efficiency[74] Group 3: Attention Mechanism Optimization - The attention mechanism is crucial for the success of large language models, but its computational complexity grows quadratically with sequence length, prompting industry exploration of simplified multi-head attention (MHA) methods[67] - Recent advancements like MLA and Mamba-2 have shown significant potential for optimizing the attention mechanism, leading to improved performance and reduced computational demands[44] - The industry is actively seeking alternatives to traditional attention mechanisms to enhance model efficiency, with ongoing research into state space models like Mamba[52]