SUE

Search documents
国内外AI服务器Scale up方案对比
傅里叶的猫· 2025-08-18 15:04
Core Viewpoint - The article discusses the comparison of Scale Up solutions among major domestic and international companies in AI data centers, highlighting the importance of high-performance interconnect technologies and architectures for enhancing computational capabilities. Group 1: Scale Up Architecture - Scale Up enhances computational power by increasing the density of individual servers, integrating more high-performance GPUs, larger memory, and faster storage to create "super nodes" [1] - It is characterized by high bandwidth and low latency, making it suitable for AI inference and training tasks [1] - Scale Up often combines with Scale Out to balance single-machine performance and overall scalability [1] Group 2: NVIDIA's NVLink Technology - NVIDIA employs its self-developed NVLink high-speed interconnect technology in its Scale Up architecture, achieving high bandwidth and low latency for GPU interconnects [3] - The GB200 NVL72 cabinet architecture integrates 18 compute trays and 9 NVLink switch trays, utilizing copper cables for efficient interconnect [3] - Each compute tray contains 2 Grace CPUs and 4 Blackwell GPUs, with NVSwitch trays equipped with NVSwitch5 ASICs [3] Group 3: Future Developments - NVIDIA's future Rubin architecture will upgrade to NVLink 6.0 and 7.0, significantly enhancing bandwidth density and reducing latency [5] - These improvements aim to support the training of ultra-large AI models with billions or trillions of parameters, addressing the growing computational demands [5] Group 4: Other Companies' Solutions - AMD's UALink aims to provide an open interconnect standard for scalable accelerator connections, supporting up to 1024 accelerators with low latency [16] - AWS utilizes the NeuronLink protocol for horizontal scaling, enhancing interconnect capabilities through additional switch trays [21] - Meta employs Broadcom's SUE solution for horizontal scaling, with plans to consider NVIDIA's NVLink Fusion in future architectures [24] Group 5: Huawei's Approach - Huawei adopts a multi-cabinet all-optical interconnect solution with its Cloud Matrix system, deploying Ascend 910C chips across multiple racks [29] - The Cloud Matrix 384 configuration includes 6912 optical modules, facilitating both Scale Up and Scale Out networks [29]