Workflow
DeepSeek致谢腾讯技术团队:对DeepEP的优化,是一次“huge speedup”代码贡献
Xin Lang Ke Ji·2025-05-07 11:12

Core Insights - Tencent's technical team has optimized the DeepEP communication framework, achieving significant performance improvements across various network environments, with a 100% performance increase in RoCE networks and a 30% increase in IB networks, enhancing AI large model training efficiency [1][2] Group 1: Technical Enhancements - The optimization involved replacing IBRC with IBGDA and utilizing distinct Queue Pairs (QPs) per channel for parallel data transmission, which improved the robustness and communication performance of the normal kernels [1] - The algorithm bandwidth for the optimized framework reached 58 GB/s in RDMA scenarios, with physical bandwidth calculated at 43.5 GB/s [1] Group 2: Industry Impact - Since the open-sourcing of DeepSeek, including DeepEP, in February, the framework has demonstrated a 300% increase in communication efficiency, addressing the dependency on NVIDIA NCCL for MoE architecture large models [2] - The optimizations have been successfully applied in Tencent's mixed Yuan model projects, showcasing excellent versatility in high-performance environments built with Tencent's Starry Network and H20 servers [2]