专家即服务

Search documents
为MoE解绑:全新「专家即服务」推理架构发布,超细粒度扩展锐减37.5%成本
机器之心· 2025-10-13 04:21
Core Viewpoint - The article discusses the challenges and innovations in the inference of large language models, particularly focusing on the Mixture-of-Experts (MoE) architecture and the introduction of the Expert-as-a-Service (EaaS) model to enhance efficiency, scalability, and robustness in model inference [2][4][25]. Group 1: Challenges in MoE Inference - The inference cost of large language models has increased exponentially, prompting the need for cost reduction strategies [2]. - Existing MoE frameworks face scalability issues due to the requirement for large-scale synchronous communication, leading to resource wastage [2]. - MoE systems exhibit low fault tolerance, where a single node failure can cause the entire service cluster to restart, resulting in service interruptions [3]. - Load imbalance occurs as the activation of experts is dynamically sparse, leading to some GPU nodes being overloaded while others remain idle [4]. Group 2: Introduction of EaaS - EaaS transforms the MoE inference architecture into a microservices-like model, allowing for flexible scheduling and independent scaling of expert services [7]. - The architecture decouples the expert layer from the Attention layer, enabling asynchronous processing and improving pipeline utilization [10]. - EaaS employs a dynamic batching mechanism and a custom communication library based on InfiniBand GPUDirect Async (IBGDA) to minimize communication latency and kernel launch overhead [14]. Group 3: Performance and Scalability - EaaS demonstrates superior scalability and fault tolerance compared to traditional MoE inference systems, with the ability to maintain throughput even during GPU node failures [15][20]. - The system allows for fine-grained resource allocation, enabling cloud service providers to adjust computational resources dynamically based on real-time load [18]. - EaaS can achieve up to 37.5% GPU resource savings while maintaining performance levels comparable to static architectures [18]. Group 4: Future Potential - EaaS shows significant potential in cloud-based large model inference and model-as-a-service (MaaS) scenarios, aligning with the needs of multi-tenant environments and continuous delivery [25]. - The modular design of EaaS facilitates independent upgrades and maintenance, allowing the system to evolve with changing model scales and application demands [25].