Nvidia-应对英伟达第二次“卡脖子”，中国正补齐关键短板

Core Insights - The AI era is facing a critical challenge with the shortage of high-speed interconnect networks, which is essential for the efficient operation of large-scale computing clusters [1][10] - Domestic companies are making strides in developing their own computing chips, but the core technology for high-speed interconnects remains dominated by Nvidia, posing a significant risk to the industry [1][10] Group 1: Industry Challenges - The transition from GPU to high-speed interconnects is becoming a new bottleneck as the scale of computing clusters increases from thousands to tens of thousands of nodes [1][4] - The communication time in distributed training can account for 30-50% of the total time, indicating that a significant portion of the investment in computing power is wasted on data transfer rather than computation [4][5] - The demand for high-speed networks has increased by 10 to 20 times as servers now require multiple network cards to support GPU-centric architectures [6] Group 2: Technological Landscape - There are two main technological routes in the high-speed network domain: RoCE and InfiniBand, with the latter being the preferred choice for high-performance computing due to its superior performance metrics [7][10] - InfiniBand networks are used in approximately 60% of the world's high-performance computing systems, and they are almost standard in the largest AI training clusters [10] Group 3: Domestic Developments - In response to the challenges posed by the dominance of foreign technology, companies like Zhongke Shuguang have developed their own high-speed network solutions, such as the scaleFabric, which is fully self-developed [2][11] - The decision to pursue a fully self-developed InfiniBand system was driven by the inadequacy of available commercial IPs and open-source solutions to meet the performance and reliability requirements for large-scale clusters [12]