Core Insights - The article discusses the rapid evolution of AI computing clusters towards scales of tens of thousands and hundreds of thousands of nodes, emphasizing the critical role of high-speed interconnect networks in efficiently releasing computing power [1][7] - It highlights the launch of scaleFabric by Zhongke Shuguang, which is the first native lossless RDMA high-speed network in China, addressing key industry challenges and providing a stable network foundation for large-scale clusters [1][7] Performance and Technical Strength - scaleFabric aligns its bandwidth and latency metrics with international mainstream products, achieving a port density of 80 ports at 400G, which is a 25% improvement over similar products, thus supporting the scalability of the scaleX ten-thousand-node super cluster [3][9] - The technology employs credit flow control and link layer retransmission mechanisms consistent with InfiniBand (IB), ensuring true lossless transmission and making it more suitable for large-scale intelligent computing scenarios compared to RoCE networks [3][9] Ecosystem Compatibility and Expansion - scaleFabric offers native RDMA verbs interfaces, fully compatible with existing IB application ecosystems, allowing seamless migration of applications like parallel computing and large model training without code modifications [4][10] - It surpasses the five-thousand-node limitation of the IB protocol, supporting over ten thousand nodes in a single subnet and enabling million-node cluster deployments through multi-track technology, which meets the exponential growth demands of AI computing power [4][10] Innovation and Cost Efficiency - In response to the high-end SerDes IP bottleneck, Shuguang has developed a self-researched 112G PAM4 high-speed SerDes IP to ensure signal reliability in complex environments [6][12] - The company has also created a millisecond-level link fault routing recovery technology that maintains recovery time regardless of network scale, enhancing cluster availability to 99.99% [6][12] - The networking cost of scaleFabric is approximately 30% lower than that of IB, breaking the high-cost constraints of high-end networks, and its launch fills a technological gap in China's native RDMA networks while promoting the domestic replacement of IB networks [6][12]
特写|万卡集群的“神经枢纽”