超节点互连技术落地 国产万卡超集群首次真机亮相

Core Insights - The article discusses the emergence of high-performance computing clusters, specifically the scaleX ultra-cluster developed by Sugon, which integrates 16 scaleX640 supernodes to achieve over 5 EFlops of computing power, marking a significant advancement in domestic AI computing infrastructure [4][5]. Group 1: Ultra-Cluster Development - The scaleX ultra-cluster is the world's first single-cabinet 640-card supernode, utilizing advanced technologies such as high-density blade servers and immersion cooling, resulting in a 20-fold increase in computing density and a PUE value as low as 1.04 [1][4]. - The scaleX ultra-cluster represents a shift from traditional scattered server deployments to a more integrated and efficient computing unit, showcasing the progress of domestic computing infrastructure from conceptual designs to tangible products [1][5]. Group 2: Demand for Computing Power - As mainstream AI models transition from hundreds of billions to trillions of parameters, the demand for computing power has surged, necessitating the development of EFLOPS-level and ten-thousand-card high-performance clusters as standard configurations for large models [2][3]. - The supernode architecture is becoming a preferred choice for new ten-thousand-card clusters due to its density and performance advantages, allowing for significant optimization in computing capabilities [3]. Group 3: Networking and Scalability - The scaleX ultra-cluster employs the scaleFabric high-speed network, which utilizes the first domestic 400G-class InfiniBand RDMA network cards, achieving 400 Gb/s bandwidth and under 1 microsecond communication latency, enhancing scalability to over 100,000 cards [7]. - The architecture allows for both Scale-up (vertical expansion) and Scale-out (horizontal expansion), addressing traditional communication bottlenecks and enabling the construction of large-scale intelligent computing clusters [6]. Group 4: Challenges and Considerations - The deployment of supernodes introduces systemic challenges, including heat dissipation from numerous chips, stability issues from mixed optical and copper interconnects, and reliability concerns from long-term operation of multiple components [8]. - As the scale of intelligent computing clusters expands, key challenges include ensuring scalability, reliability, and energy efficiency, necessitating breakthroughs in power supply technology and advanced software management for sustainable operation [8].