超DeepEP两倍！无问芯穹FUSCO以「空中变阵」突破MoE通信瓶颈，专为Agent爆发设计

Core Viewpoint - The article discusses the increasing adoption of the Mixture-of-Experts (MoE) architecture in large models like ChatGPT and Gemini, highlighting the challenges in communication and data rearrangement that arise from this architecture, particularly in high concurrency and long context scenarios [1][2]. Group 1: MoE Architecture and Challenges - MoE models introduce significant global distributed data exchange due to their sparse structure and expert parallelism, leading to performance bottlenecks in existing communication libraries like DeepEP [2]. - The communication and data rearrangement overhead increases with the scale of expert parallelism, making distributed data shuffling a critical performance bottleneck in training and inference [11][14]. Group 2: Introduction of FUSCO - FUSCO, developed in collaboration with several universities, aims to optimize communication for MoE models by integrating communication processes with data layout transformations, eliminating redundant data rearrangement [3][4]. - Experimental results show that FUSCO can improve communication performance by up to 3.84 times compared to NCCL and 2.01 times compared to DeepEP, especially as the number of concurrent requests and text length increases [4][44]. Group 3: FUSCO Design and Functionality - FUSCO's design allows for data rearrangement to occur during the communication process, maximizing GPU and network bandwidth utilization while minimizing additional memory operations [16][27]. - The communication interface of FUSCO is built around logical segments, allowing precise data access and placement without intermediate buffering or post-processing rearrangement [21][23]. Group 4: Performance Evaluation - In tests involving 64 GPUs, FUSCO demonstrated significant improvements in communication efficiency across various traffic configurations, effectively reducing communication overhead and enhancing load balancing [44][45]. - FUSCO's end-to-end performance improvements in training and inference tasks were notable, with enhancements of up to 1.39 times compared to NCCL and 1.19 times compared to DeepEP [47][48].