InfiniteHBD

Search documents
网络基础设施如何支撑大模型应用?北京大学刘古月课题组5大方向研究,相关论文入选ACM SIGCOMM 2025
AI前线· 2025-09-23 06:37
Core Insights - The article discusses the urgent need for advanced network infrastructure to support large language model training and data center security in the context of rapid advancements in intelligent computing and future networks [2][3]. Group 1: Research Achievements - The research group led by Assistant Professor Liu Guyue from Peking University has made significant contributions, with five high-level papers accepted at ACM SIGCOMM 2025, making it the highest-publishing research group from a university this year [2][3]. - The acceptance rate for SIGCOMM 2025 was only 16.1%, with 461 submissions and only 74 accepted [2]. Group 2: Key Research Papers - **InfiniteHBD**: Proposes a transceiver-centered high-bandwidth domain architecture that overcomes scalability and fault tolerance issues in large model training, achieving a cost reduction to 31% of NVL-72 and nearly zero GPU waste [6][8]. - **DNSLogzip**: Introduces a novel approach for fast and high-ratio compression of DNS logs, reducing storage costs by approximately two-thirds, saving up to $163,000 per month per DNS service node [11][12]. - **BiAn**: A framework based on large language models for intelligent fault localization in production networks, reducing root cause identification time by 20.5% and improving accuracy by 9.2% [13][14]. - **MixNet**: A runtime reconfigurable optical-electrical network structure for distributed mixture-of-experts training, enhancing network cost efficiency by 1.2 to 2.3 times under various bandwidth conditions [15][18]. - **Mazu**: A high-speed encrypted traffic anomaly detection system implemented on programmable switches, successfully protecting over ten million servers and detecting malicious traffic with approximately 90% accuracy [19][22]. Group 3: Overall Impact - The five research outcomes collectively form a comprehensive technological loop across architecture, data, operations, and security, driving the efficient, reliable, and intelligent development of next-generation network systems [3].
全新GPU高速互联设计,为大模型训练降本增效!北大/阶跃/曦智提出新一代高带宽域架构
量子位· 2025-05-19 04:37
Core Viewpoint - The article discusses the limitations of existing High-Bandwidth Domain (HBD) architectures for large model training and introduces InfiniteHBD, a new architecture that addresses these limitations through innovative design and technology [1][3][4]. Group 1: Limitations of Existing HBD Architectures - Current HBD architectures face fundamental limitations in scalability, cost, and fault tolerance, with switch-centric designs being expensive and hard to scale, GPU-centric designs suffering from fault propagation issues, and hybrid designs like TPUv4 still not ideal in cost and fault tolerance [3][10][19]. - The existing architectures can be categorized into three types: switch-centric, GPU-centric, and hybrid, each with its own set of limitations regarding scalability, interconnect cost, fault explosion radius, and fragmentation [7][22]. Group 2: Introduction of InfiniteHBD - InfiniteHBD is proposed as a solution, utilizing Optical Circuit Switching (OCS) technology embedded in optical-electrical conversion modules to achieve low-cost scalability and node-level fault isolation [4][29]. - The cost of InfiniteHBD is only 31% of that of NVL-72, with near-zero GPU wastage, significantly improving Model FLOPs Utilization (MFU) by up to 3.37 times compared to traditional architectures [4][48][63]. Group 3: Key Innovations of InfiniteHBD - InfiniteHBD incorporates three key innovations: OCS-based optical-electrical conversion modules (OCSTrx), a reconfigurable K-Hop Ring topology, and an HBD-DCN orchestration algorithm [30][32][44]. - The OCSTrx allows for dynamic point-to-multipoint connections and low resource fragmentation, enhancing scalability and cost-effectiveness [29][35]. Group 4: Performance Evaluation - The performance evaluation of InfiniteHBD shows it can effectively meet the dual demands of computational efficiency and communication performance for large-scale training of language models [65]. - The orchestration algorithm optimizes communication efficiency, significantly reducing cross-Top of Rack (ToR) traffic and demonstrating resilience against node failures [68][70]. Group 5: Cost and Energy Efficiency - InfiniteHBD exhibits significant advantages in interconnect cost and energy consumption, with interconnect costs being 31% of NVL-72 and energy consumption being 75% of NVL-72, while maintaining low energy levels comparable to TPUv4 [74].