9位顶级研究员连讲3晚，华为盘古大模型底层研究大揭秘

Core Viewpoint - The rapid development of large language models (LLMs) has become a cornerstone of general artificial intelligence systems, but the increase in model capabilities has led to significant growth in computational and storage demands, presenting a challenge for achieving high performance and efficiency in AI [1][2]. Group 1: Technological Advancements - Huawei's Noah's Ark Lab has developed the Pangu Ultra, a general language model with over 100 billion parameters, surpassing previous models like Llama 405B and Mistral Large 2 in various evaluations [2]. - The lab also introduced the sparse language model Pangu Ultra MoE, achieving long-term stable training on over 6000 Ascend NPUs [2]. Group 2: Key Research Presentations - A series of sharing sessions from May 28 to May 30 will cover breakthroughs in quantization, pruning, MoE architecture optimization, and KV optimization, aimed at developers and researchers interested in large models [3][4]. Group 3: Specific Research Contributions - CBQ: A post-training quantization framework that addresses the high computational and storage costs of LLMs, achieving significant performance improvements in ultra-low bit quantization [6]. - SlimLLM: A structured pruning method that effectively reduces the computational load of LLMs while maintaining accuracy, demonstrating advanced performance in LLaMA benchmark tests [8]. - KnowTrace: An iterative retrieval-augmented generation framework that enhances multi-step reasoning by tracking knowledge triplets, outperforming existing methods in multi-hop question answering [10]. Group 4: Further Innovations - Pangu Embedded: A flexible language model that alternates between fast and deep thinking, designed to optimize inference efficiency while maintaining high accuracy [14]. - Pangu-Light: A pruning framework that stabilizes and optimizes performance after aggressive structural pruning, achieving significant model compression and inference acceleration [16]. - ESA: An efficient selective attention method that reduces computational overhead during inference by leveraging the sparsity of attention matrices [18]. Group 5: MoE Model Developments - Pangu Pro MoE: A native MoE model with 72 billion parameters, designed to balance load across devices and enhance inference efficiency through various optimization techniques [21]. - PreMoe: An expert routing optimization for MoE models that allows dynamic loading of experts based on task-specific requirements, improving inference efficiency by over 10% while maintaining model capability [24]. Group 6: KV Optimization Techniques - KVTuner: A hardware-friendly algorithm for KV memory compression that achieves near-lossless quantization without requiring retraining, significantly enhancing inference speed [26]. - TrimR: An efficient reflection compression algorithm that identifies redundant reflections in LLMs, leading to a 70% improvement in inference efficiency across various models [26].