Core Viewpoint - The article highlights the recognition of the DCP (Data Control Partitioning) research by Huawei's Network Technology Lab and the Hong Kong University of Science and Technology at the ACM SIGCOMM 2025 conference, emphasizing its significance in addressing scalability challenges in AI cluster networks [2][4]. Group 1: Conference Overview - The ACM SIGCOMM 2025 conference, a premier event in the field of computer networking, concluded in Portugal, featuring cutting-edge technology discussions and attracting global participation from major OTT and networking equipment manufacturers [2][4]. - Out of 463 submissions, only 75 papers were accepted, resulting in an acceptance rate of 16.2%, with only three papers receiving awards [4]. Group 2: DCP Technology - The DCP technology addresses the scalability challenges posed by the rapid growth of AI models and the increasing demand for computational power, which necessitates larger and more complex network configurations [6][7]. - DCP proposes a novel RDMA (Remote Direct Memory Access) transmission architecture that allows for lossy transmission of data while ensuring lossless transmission of control information, significantly reducing buffer dependency and eliminating issues like head-of-line blocking and deadlocks [8][10]. Group 3: Experimental Results - Prototype testing of DCP demonstrated a 1.6× to 72× improvement in packet recovery efficiency compared to Mellanox RNIC, and a 42% reduction in completion time for AI workloads [17]. - Simulation results indicated that DCP reduced job completion time (JCT) by 38% and 45% in AI traffic scenarios compared to existing solutions, and achieved a 95% reduction in tail completion time in long-distance scenarios [20][22]. Group 4: Future Directions - Huawei's Network Technology Lab is also researching AI-Native Transport (ANT), which incorporates features from DCP to enhance transmission capabilities for AI computing networks, focusing on high throughput, efficiency, and scalability [22].
网络顶会获奖!华为提出端网协同RDMA传输架构,解决大规模AI集群网络可扩展性问题
机器之心·2025-09-16 11:57