SUE
Search documents
回归技术--Scale Up割裂的生态
傅里叶的猫· 2025-10-18 16:01
Core Viewpoint - The article discusses the comparison of Scale Up solutions in AI servers, focusing on the UALink technology promoted by Marvell and the current mainstream Scale Up approaches in the international market [1][3]. Comparison of Scale Up Solutions - Scale Up refers to high-speed communication networks between GPUs within the same server or rack, allowing them to operate collaboratively as a large supercomputer [3]. - The market for Scale Up networks is projected to reach $4 billion in 2024, with a compound annual growth rate (CAGR) of 34%, potentially growing to $17 billion by 2029 [5][7]. Key Players and Technologies - NVIDIA's NVLink technology is currently dominant in the Scale Up market, enabling GPU interconnection and communication within server configurations [11][12]. - AMD is developing UALink, which is based on its Infinity Fabric technology, and aims to transition to a complete UALink solution once native switches are available [12][17]. - Google utilizes inter-chip interconnect (ICI) technology for TPU Scale Up, while Amazon employs NeuronLink for its Trainium chips [13][14]. Challenges in the Ecosystem - The current ecosystem for Scale Up solutions is fragmented, with various proprietary technologies leading to compatibility issues among different manufacturers [10][22]. - Domestic GPU manufacturers face challenges in developing their own interconnect protocols due to system complexity and resource constraints [9]. Future Trends - The article suggests that as the market matures, there will be a shift from proprietary Scale Up networks to open solutions like UAL and SUE, which are expected to gain traction by 2027-2028 [22]. - The choice between copper and optical connections for Scale Up networks is influenced by cost and performance, with copper currently being the preferred option for short distances [20][21].
开源证券:国产Scale-up/Scale-out硬件商业化提速 聚焦AI运力产业投资机遇
智通财经网· 2025-10-15 07:35
Core Viewpoint - The traditional computing architecture is insufficient for the efficient, low-energy, and large-scale collaborative AI training needs, leading to the trend of supernodes which significantly boosts the demand for Scale up-related hardware [1][3] Group 1: AI Hardware Capabilities - AI hardware capabilities are driven by three main factors: computing power (determined by GPU performance and quantity), storage capacity (high-bandwidth memory cache close to GPUs), and communication capacity (encompassing Scale up, Scale out, and Scale across scenarios) [1][2] Group 2: Market Trends and Projections - The market for Scale up switching chips is expected to reach nearly $18 billion by 2030, with a CAGR of approximately 28% from 2022 to 2030, driven by the demand for supernodes [3] - The construction of large-scale AI clusters necessitates extensive interconnectivity between nodes, leading to increased demand for Scale out hardware, while power resource limitations in single regions will promote the adoption of Scale across solutions [3] Group 3: Communication Protocols - Different communication protocols are required for Scale up and Scale out, with major companies developing proprietary protocols alongside third-party and smaller firms promoting public protocols [4] - Notable proprietary protocols for Scale up include NVIDIA's NVlink and AMD's Infinity Fabric, while public protocols include Broadcom's SUE and PCIe [4] Group 4: Domestic Hardware Development - The domestic production rate of communication hardware is currently very low, presenting a significant opportunity for domestic replacement in the market [5] - Companies like Shudao Technology and Shengke Communication are advancing towards commercialization of their products, indicating a growing domestic market potential [5] Group 5: Investment Opportunities - Beneficiaries of PCIe hardware include Wantong Development and Lanke Technology, while Ethernet hardware beneficiaries include Shengke Communication and ZTE [6]
国内外AI服务器Scale up方案对比
傅里叶的猫· 2025-08-18 15:04
Core Viewpoint - The article discusses the comparison of Scale Up solutions among major domestic and international companies in AI data centers, highlighting the importance of high-performance interconnect technologies and architectures for enhancing computational capabilities. Group 1: Scale Up Architecture - Scale Up enhances computational power by increasing the density of individual servers, integrating more high-performance GPUs, larger memory, and faster storage to create "super nodes" [1] - It is characterized by high bandwidth and low latency, making it suitable for AI inference and training tasks [1] - Scale Up often combines with Scale Out to balance single-machine performance and overall scalability [1] Group 2: NVIDIA's NVLink Technology - NVIDIA employs its self-developed NVLink high-speed interconnect technology in its Scale Up architecture, achieving high bandwidth and low latency for GPU interconnects [3] - The GB200 NVL72 cabinet architecture integrates 18 compute trays and 9 NVLink switch trays, utilizing copper cables for efficient interconnect [3] - Each compute tray contains 2 Grace CPUs and 4 Blackwell GPUs, with NVSwitch trays equipped with NVSwitch5 ASICs [3] Group 3: Future Developments - NVIDIA's future Rubin architecture will upgrade to NVLink 6.0 and 7.0, significantly enhancing bandwidth density and reducing latency [5] - These improvements aim to support the training of ultra-large AI models with billions or trillions of parameters, addressing the growing computational demands [5] Group 4: Other Companies' Solutions - AMD's UALink aims to provide an open interconnect standard for scalable accelerator connections, supporting up to 1024 accelerators with low latency [16] - AWS utilizes the NeuronLink protocol for horizontal scaling, enhancing interconnect capabilities through additional switch trays [21] - Meta employs Broadcom's SUE solution for horizontal scaling, with plans to consider NVIDIA's NVLink Fusion in future architectures [24] Group 5: Huawei's Approach - Huawei adopts a multi-cabinet all-optical interconnect solution with its Cloud Matrix system, deploying Ascend 910C chips across multiple racks [29] - The Cloud Matrix 384 configuration includes 6912 optical modules, facilitating both Scale Up and Scale Out networks [29]