破解MoE模型“规模越大，效率越低”困境！中科院自动化所提出新框架

Core Viewpoint - The article discusses a new research breakthrough from the Institute of Automation, Chinese Academy of Sciences, which addresses the challenges faced by large language models (LLMs) using a dynamic "group learning" approach to optimize the Mixture of Experts (MoE) framework, significantly reducing parameter count and improving efficiency [1][12]. Summary by Sections MoE Challenges - MoE has been a key method for expanding parameter size in LLMs while keeping computational costs linear, but it faces three main challenges: load imbalance, parameter redundancy, and communication overhead, which hinder its practical deployment [2][5]. - These challenges stem from hardware limitations, leading to fragmented optimization efforts that fail to address the underlying issues cohesively [6][8]. Research Findings - The research team discovered that experts activated by semantically similar inputs exhibit structural redundancy, providing a theoretical basis for a dynamic and structured organization of experts [10][11]. - The proposed framework allows for an 80% reduction in total parameter count, a 10%-20% increase in throughput, and a significant decrease in peak memory consumption, making it comparable to lightweight dense models [11][34]. Unified Framework - The framework formalizes the MoE optimization process as a unified mathematical problem, aiming to minimize task loss, load imbalance, parameter redundancy, and communication costs simultaneously [13]. - Four core technical components were designed to achieve this unified optimization: online dual similarity clustering, shared basis and low-rank residual compression, hierarchical routing, and heterogeneous precision with dynamic memory management [13][30]. Technical Components 1. Online Dual Similarity Clustering: This method dynamically reorganizes expert groups based on structural and functional similarities, addressing load imbalance issues [14][16]. 2. Shared Basis and Low-Rank Residual Compression: This approach reduces redundancy by sharing a common weight matrix among similar experts while representing unique characteristics with low-rank matrices [19][22]. 3. Hierarchical Routing: A two-stage routing strategy reduces computational complexity and communication overhead by first selecting clusters and then experts within those clusters [24][29]. 4. Heterogeneous Precision and Dynamic Memory Management: This strategy optimizes memory usage by employing different numerical precisions for various components and dynamically unloading inactive expert parameters from GPU memory [30][31]. Experimental Validation - Comprehensive experiments on standard NLP benchmarks demonstrated that the proposed framework maintains comparable model quality while achieving an approximately 80% reduction in total parameters and nearly 50% reduction in peak memory consumption compared to baseline models [34][36]. - Ablation studies confirmed the essential contributions of online clustering, low-rank compression, and hierarchical routing to the overall performance improvements [37].