Workflow
Tesla Dojo
icon
Search documents
全新GPU高速互联设计,为大模型训练降本增效!北大/阶跃/曦智提出新一代高带宽域架构
量子位· 2025-05-19 04:37
Core Viewpoint - The article discusses the limitations of existing High-Bandwidth Domain (HBD) architectures for large model training and introduces InfiniteHBD, a new architecture that addresses these limitations through innovative design and technology [1][3][4]. Group 1: Limitations of Existing HBD Architectures - Current HBD architectures face fundamental limitations in scalability, cost, and fault tolerance, with switch-centric designs being expensive and hard to scale, GPU-centric designs suffering from fault propagation issues, and hybrid designs like TPUv4 still not ideal in cost and fault tolerance [3][10][19]. - The existing architectures can be categorized into three types: switch-centric, GPU-centric, and hybrid, each with its own set of limitations regarding scalability, interconnect cost, fault explosion radius, and fragmentation [7][22]. Group 2: Introduction of InfiniteHBD - InfiniteHBD is proposed as a solution, utilizing Optical Circuit Switching (OCS) technology embedded in optical-electrical conversion modules to achieve low-cost scalability and node-level fault isolation [4][29]. - The cost of InfiniteHBD is only 31% of that of NVL-72, with near-zero GPU wastage, significantly improving Model FLOPs Utilization (MFU) by up to 3.37 times compared to traditional architectures [4][48][63]. Group 3: Key Innovations of InfiniteHBD - InfiniteHBD incorporates three key innovations: OCS-based optical-electrical conversion modules (OCSTrx), a reconfigurable K-Hop Ring topology, and an HBD-DCN orchestration algorithm [30][32][44]. - The OCSTrx allows for dynamic point-to-multipoint connections and low resource fragmentation, enhancing scalability and cost-effectiveness [29][35]. Group 4: Performance Evaluation - The performance evaluation of InfiniteHBD shows it can effectively meet the dual demands of computational efficiency and communication performance for large-scale training of language models [65]. - The orchestration algorithm optimizes communication efficiency, significantly reducing cross-Top of Rack (ToR) traffic and demonstrating resilience against node failures [68][70]. Group 5: Cost and Energy Efficiency - InfiniteHBD exhibits significant advantages in interconnect cost and energy consumption, with interconnect costs being 31% of NVL-72 and energy consumption being 75% of NVL-72, while maintaining low energy levels comparable to TPUv4 [74].