Core Insights - The article introduces Mixture-of-Lookup-Experts (MoLE), a new architecture designed to optimize the deployment of Mixture-of-Experts (MoE) models, particularly in resource-constrained environments [1][28] - MoLE addresses the challenges of high memory usage and transmission delays associated with traditional MoE during inference by replacing matrix operations with lookup tables [28] Group 1: MoLE Architecture - MoLE activates only a small subset of experts needed for each token during inference, significantly reducing computational load while maintaining a large parameter scale [1] - The architecture allows for the pre-computation of input-output mappings stored as lookup tables, enabling efficient retrieval during inference [5][6] Group 2: Training Phase Differences - In the training phase, MoLE modifies the input to routed experts from the previous layer's output to shallow embedding tokens, facilitating the pre-computation and storage of lookup tables [8] - MoLE employs an activation strategy that activates all routed experts during training, eliminating the need for sparse activation to control computational load [9] - The loss design in MoLE focuses solely on language modeling loss, without additional load balancing loss terms [10] Group 3: Inference Phase Process - During inference, MoLE constructs lookup tables from the embedding layer's weight matrix, allowing for direct retrieval of expert outputs based on token IDs [15] - The lookup table is stored in lower storage devices, and during inference, the corresponding expert outputs are retrieved and loaded into memory for computation [16] Group 4: Performance and Efficiency - MoLE's computational complexity during inference is comparable to dense models and traditional MoE models, while significantly reducing transmission overhead [17] - Experimental results indicate that MoLE achieves performance on par with MoE while drastically reducing transmission costs by over a thousand times [20][28] Group 5: Experimental Results - The experiments conducted on the Pile dataset show that MoLE maintains performance equivalent to MoE while using the same training parameters and inference activation parameters [20] - MoLE demonstrates lower inference latency compared to MoE, especially in batch decoding scenarios, highlighting its advantages in high-throughput tasks [28]
ICML 2025 Spotlight|华为诺亚提出端侧大模型新架构MoLE,内存搬运代价降低1000倍
机器之心·2025-05-07 00:33