大模型训练

Search documents
全新GPU高速互联设计,为大模型训练降本增效!北大/阶跃/曦智提出新一代高带宽域架构
量子位· 2025-05-19 04:37
Core Viewpoint - The article discusses the limitations of existing High-Bandwidth Domain (HBD) architectures for large model training and introduces InfiniteHBD, a new architecture that addresses these limitations through innovative design and technology [1][3][4]. Group 1: Limitations of Existing HBD Architectures - Current HBD architectures face fundamental limitations in scalability, cost, and fault tolerance, with switch-centric designs being expensive and hard to scale, GPU-centric designs suffering from fault propagation issues, and hybrid designs like TPUv4 still not ideal in cost and fault tolerance [3][10][19]. - The existing architectures can be categorized into three types: switch-centric, GPU-centric, and hybrid, each with its own set of limitations regarding scalability, interconnect cost, fault explosion radius, and fragmentation [7][22]. Group 2: Introduction of InfiniteHBD - InfiniteHBD is proposed as a solution, utilizing Optical Circuit Switching (OCS) technology embedded in optical-electrical conversion modules to achieve low-cost scalability and node-level fault isolation [4][29]. - The cost of InfiniteHBD is only 31% of that of NVL-72, with near-zero GPU wastage, significantly improving Model FLOPs Utilization (MFU) by up to 3.37 times compared to traditional architectures [4][48][63]. Group 3: Key Innovations of InfiniteHBD - InfiniteHBD incorporates three key innovations: OCS-based optical-electrical conversion modules (OCSTrx), a reconfigurable K-Hop Ring topology, and an HBD-DCN orchestration algorithm [30][32][44]. - The OCSTrx allows for dynamic point-to-multipoint connections and low resource fragmentation, enhancing scalability and cost-effectiveness [29][35]. Group 4: Performance Evaluation - The performance evaluation of InfiniteHBD shows it can effectively meet the dual demands of computational efficiency and communication performance for large-scale training of language models [65]. - The orchestration algorithm optimizes communication efficiency, significantly reducing cross-Top of Rack (ToR) traffic and demonstrating resilience against node failures [68][70]. Group 5: Cost and Energy Efficiency - InfiniteHBD exhibits significant advantages in interconnect cost and energy consumption, with interconnect costs being 31% of NVL-72 and energy consumption being 75% of NVL-72, while maintaining low energy levels comparable to TPUv4 [74].
我省建成全国产化洪水预报调度系统
Liao Ning Ri Bao· 2025-05-13 01:46
日前,我省水利部门建成全国产化洪水预报调度系统,实现对全省大中型水库及流域200平方公里 以上河流的洪水预报全覆盖,洪水平均预见期延长7天,为防洪决策抢占先机奠定了基础。 就新系统的应用优势,省水文局水情中心相关负责人举例,同一条河流由于不同地区的水土条件不 一样,有超渗产流和蓄满产流的差别,但过去一条河流只有一个模型,就需要模型数据和经验数据相结 合而作研判。而现在的模型组合集成,就解决了类似差别前提下的预测时效和精度问题。 下一步,我省水利部门还将深化气象雷达短临预报技术应用,融合多源降雨数据与流域地形、土 壤、植被等下垫面信息,研发分布式水文模型,推动预见期延长与精度提升协同优化。建立辽宁水文知 识库,开展大模型训练与智能化开发,推进全省水文行业智能化升级。 新系统还植入我省102处国家基本水文站、37座大型水库、76座中型水库预报方案,集成286处中小 河流站预报方案,提升全省流域洪水预报能力。同时,精准关联薄弱环节耦合水文水动力模型,构建16 条主要河流一维洪水演进模型,首次实现由单一断面向河系尺度的预报预演转变,精准关联全省1278个 防洪薄弱环节(不达标堤段、险工险段、薄弱村屯、砂堤砂基),为防 ...
电子行业跟踪周报:架构级创新,华为UBMesh直击大模型训练的“通信墙”与成本痛点-20250511
Soochow Securities· 2025-05-11 14:05
Investment Rating - The report maintains an "Add" investment rating for the electronic industry, indicating a positive outlook for the sector over the next six months [1]. Core Insights - Huawei's UB Mesh architecture addresses the cost and performance challenges associated with large model training, achieving a 2.04 times cost efficiency improvement compared to the Clos architecture [3]. - The UB Mesh architecture utilizes a 4D/5D topology, allowing for high bandwidth, low-cost, and high-reliability AI training clusters, significantly reducing the reliance on expensive network infrastructure [6][7]. - The report highlights the potential for Huawei's Ascend 920 series chips to capture a significant market share in domestic computing power demand, driven by their innovative capabilities and the ongoing trend of domestic computing power substitution [7]. Summary by Sections Industry Trends - The UB Mesh architecture is designed to meet the demands for large-scale AI training clusters, providing a flexible multi-dimensional aggregation that reduces transmission overhead [6]. - The architecture's reliance on short-distance direct interconnections minimizes costs and enhances system reliability, with 86.7% of the system's cabling being passive [6]. Performance Comparison - Under the same training benchmarks, the UB Mesh architecture demonstrated similar performance to the Clos architecture while significantly lowering hardware costs, with a reduction in network infrastructure costs from 67% to 20% [3]. - The operational cost savings from reduced use of high-performance switches and optical modules amounted to a 35% decrease [3]. Market Opportunities - The report identifies several companies in the supply chain that may benefit from the advancements in Huawei's technology, including SMIC, Huafeng Technology, and others [7].
CVPR Oral | 南京大学李武军教授课题组推出分布式训练算法UniAP,大模型训练最高加速3.8倍
机器之心· 2025-04-30 04:23
李武军教授为通讯作者,硕士生林昊(已毕业 ,现工作于阿里巴巴)、吴轲、李杰为共同第一作者,博士生李俊为参与作者。 训练成本高昂已经成为大模型和人工智能可持续发展的主要障碍之一。 大模型的训练往往采用多机多卡的分布式训练,大模型的分布式训练挑战巨大,即使硬件足够,不熟悉分布式训练的人大概率(实验中验证有 64%-87% 的概率)会因为超参数设置(模型怎么切分和排布、数据怎么切分和排布等)不合理而无法成功运行训练过程。 此外,不熟悉分布式训练的人在碰到大模型训练慢时容易只想到增加 GPU 硬件等 横向拓展(scale-out)方法,而忽略了分布式训练算法的 纵向拓展(scale- up)作用。 论文被 CVPR 2025 录用为 Oral(所有投稿论文的 0.7%,所有录用论文的 3.3%)。 方法简介 实际上,分布式训练算法会极大地影响硬件的算力利用率。高效能分布式训练算法具有高算力利用率。用同样的硬件算力训练同一个模型,高效能分布式训 练算法会比低效能分布式训练算法速度快,最高可能会快数倍甚至数十倍以上。 也就是说,训练同一个模型,高效能分布式训练算法会比低效能分布式训练算法成本低,最高可能会节省数倍甚至数十 ...