大模型训练

Search documents
华为AI实力!不用GPU,大模型每2秒吃透一道高数大题!
第一财经· 2025-05-30 09:32
Core Viewpoint - Huawei has achieved significant advancements in training large models through its "Ascend + Pangu Ultra MoE" combination, enabling a fully controllable training process without the need for GPUs, showcasing industry-leading performance in cluster training systems [2][3]. Group 1: Technical Innovations - Huawei's training system has improved the model training efficiency significantly, with a pre-training model utilization rate (MFU) reaching 41% and a post-training throughput of 35K Tokens/s on the CloudMatrix 384 super node [3][34]. - The company has introduced a series of innovative solutions to address challenges in the MoE pre-training and reinforcement learning (RL) post-training processes, including intelligent parallel strategy selection and global dynamic load balancing [11][17]. - The training system utilizes a hierarchical All-to-All communication architecture to reduce communication overhead to nearly zero, enhancing the efficiency of expert parallel communication [14][15]. Group 2: Training Process Optimization - The training cluster's utilization has been optimized through a simulation-driven intelligent parallel optimization framework, which automates the selection of optimal deployment configurations [12][13]. - The team has implemented a memory optimization framework that achieves over 70% savings in activation memory, ensuring reliable long-term training even under increased memory pressure [25]. - The RL Fusion technology allows for flexible deployment modes, significantly improving resource scheduling during the inference phase and doubling the utilization rate in RL post-training [27][28]. Group 3: Model Specifications - The Pangu Ultra MoE model features 718 billion parameters, with a structure that includes 61 layers of Transformer architecture, designed for high sparsity and performance [32]. - The model's training utilized a cluster of 6K - 10K Ascend 800T A2 cards, achieving a high model utilization rate during the pre-training phase [32]. - The architecture supports efficient scaling to larger parameter models and clusters, with expectations of achieving an MFU greater than 50% in future iterations [32].
Pangu Ultra准万亿MoE模型:业界一流,源自昇腾原生的长稳训练
第一财经· 2025-05-29 10:50
Core Viewpoint - The article discusses the advancements in the Pangu Ultra MoE model, which is a near-trillion parameter MoE model trained on Ascend NPUs, focusing on its architecture, training methods, and performance improvements [1][3]. Group 1: Model Architecture and Training Innovations - Pangu Ultra MoE features a total parameter count of 718 billion, with 39 billion activated parameters, utilizing 256 routing experts where each token activates 8 experts [5][6]. - The model employs Depth-Scaled Sandwich-Norm (DSSN) and TinyInit methods to enhance training stability, achieving a 51% reduction in gradient spikes [7][11]. - The training process incorporates a dropless training strategy, allowing for long-term stable training on over 10 trillion tokens [1][7]. Group 2: Performance and Efficiency - The architecture is designed to optimize performance on the Ascend NPU platform by integrating computation, communication, and memory metrics, resulting in superior training and inference throughput [3][5]. - Pangu Ultra MoE demonstrates robust performance across various authoritative open-source evaluation sets, outperforming several mainstream models in multiple benchmarks [6][4]. Group 3: Load Balancing and Expert Specialization - The EP group loss method is introduced to maintain load balancing among experts while allowing for expert specialization, enhancing overall training efficiency [12][15]. - The model's design allows for flexible routing choices, promoting expert specialization based on the data domain, which is evidenced by significant differences in expert selection across various languages [16][17]. Group 4: Multi-Token Prediction and Reinforcement Learning - The Multi-Token Prediction (MTP) strategy enhances inference efficiency by predicting multiple candidate tokens before the main model generates them, achieving a 38% increase in acceptance length [20][22]. - The reinforcement learning system implemented in Pangu Ultra MoE addresses challenges in training stability and inference performance by iteratively mining difficult examples and employing a multi-capability reward system [24][27].
训练大模型,终于可以“既要又要还要”了
虎嗅APP· 2025-05-29 10:34
Core Insights - The article discusses the advancements in the MoE (Mixture of Experts) model architecture, particularly focusing on Huawei's Pangu Ultra MoE, which aims to balance model performance and efficiency while addressing challenges in training large-scale models [1][6][33] Group 1: MoE Model Innovations - Huawei's Pangu Ultra MoE model features a parameter scale of 718 billion, designed to optimize the performance and efficiency of large-scale MoE architectures [6][9] - The model incorporates advanced architectures such as MLA (Multi-head Latent Attention) and MTP (Multi-token Prediction), enhancing its training and inference capabilities [6][7] - The Depth-Scaled Sandwich-Norm (DSSN) and TinyInit methods are introduced to improve training stability, reducing gradient spikes by 51% and enabling long-term stable training with over 10 trillion tokens [11][12][14] Group 2: Load Balancing and Efficiency - The EP (Expert Parallelism) group load balancing method is designed to ensure efficient token distribution among experts, enhancing training efficiency without compromising model specialization [19][20] - The Pangu Ultra MoE model employs an EP-Group load balancing loss that allows for flexible routing choices, promoting expert specialization while maintaining computational efficiency [20][21] Group 3: Training Techniques and Performance - The model's pre-training phase utilizes dropless training, achieving a long sequence capability of 128k, which enhances its learning efficiency on target data [8][14] - The introduction of MTP allows for speculative inference, significantly improving the acceptance length by 38% compared to single-token predictions [24][27] - The reinforcement learning system designed for post-training focuses on iterative hard example mining and multi-capability collaboration, ensuring comprehensive performance across various tasks [28][31] Group 4: Future Implications - The advancements presented in Pangu Ultra MoE provide a viable path for deploying sparse large models at scale, pushing the performance limits and engineering applicability of MoE architectures [33]
广州南沙全力构建人工智能产业新高地
Zhong Guo Zheng Quan Bao· 2025-05-28 20:35
Group 1 - The "Bay Area Artificial Intelligence Industry Innovation Alliance" was officially established in Nansha District, Guangzhou, aiming to create a new high ground for the AI industry in the Guangdong-Hong Kong-Macao Greater Bay Area and globally [1][2] - The alliance is initiated by Hong Kong University of Science and Technology (Guangzhou) and Huawei, focusing on integrating various resources from international, Hong Kong, Macao, and mainland research institutions to empower Nansha and promote it as a leading area for AI innovation [1][2] - Nansha aims to upgrade its AI industry ecosystem by focusing on three core tasks: technological innovation, industrial aggregation, and ecological construction, with a goal to form a trillion-level industrial cluster [2] Group 2 - Nansha's AI-related industry scale is projected to reach approximately 10 billion yuan in 2024, with a year-on-year growth of 12%, establishing itself as a significant application demonstration area for AI in China [2] - Over 100 AI-related companies have gathered in Nansha, including CloudWalk Technology and Pony.ai, covering various fields such as AI chips, basic software algorithms, biometric recognition, and natural language processing [2] - The establishment of the alliance is expected to enhance the support for AI companies, with financial backing of up to 10 million yuan for key elements like computing power, data, and algorithms [2] Group 3 - Pony.ai, which settled in Nansha in 2017, has launched China's first autonomous taxi service and reported a revenue of 12.3 million yuan for its autonomous taxi business in Q1 2025, marking a 200% year-on-year increase [3] - The company has formed a global strategic partnership with Uber, planning to integrate its autonomous taxi services into Uber's platform starting in the Middle East [3] - The Guangdong Province has introduced policies to promote the integration of AI and robotics across various sectors, including education, healthcare, and finance [4] Group 4 - CloudWalk Technology, established in 2015 and listed on the Sci-Tech Innovation Board in 2022, focuses on AI technology and its application in key industries, aiming to deepen technology research and scene implementation [4] - Nansha's fully automated terminal has seen a 41.42% year-on-year increase in container throughput in Q1, showcasing the successful integration of advanced technologies like Beidou navigation and AI [4][5] - The terminal is recognized as the world's first fully automated terminal for multimodal transport, capable of continuous operation with a large fleet of autonomous guided vehicles [5]
湾区人工智能产业创新联盟成立
Zhong Guo Jing Ji Wang· 2025-05-27 03:32
Group 1 - The establishment of the Bay Area Artificial Intelligence Industry Innovation Alliance aims to promote collaborative innovation in the Guangdong-Hong Kong-Macao Greater Bay Area, involving over 400 representatives from government, academia, and international experts [1][2] - The alliance focuses on three core tasks: technological breakthroughs in key areas such as large model training and intelligent chips, the formation of a trillion-level industrial cluster, and the construction of a comprehensive industrial service system [2][3] - The alliance is positioned to make Nansha a leading hub for AI innovation, a national benchmark for AI+ industry development, and a global talent aggregation area for artificial intelligence [2] Group 2 - The establishment of the alliance is a significant step in implementing the national "New Generation Artificial Intelligence Development Plan" and advancing the Greater Bay Area as a globally influential international science and technology innovation center [3] - Multiple AI projects were signed during the event, including a collaboration between Huawei and Hong Kong University of Science and Technology (Guangzhou) to launch a "Science and Education Innovation Incubation Center" [3] - Huawei is also collaborating with China Railway Tunnel Group to plan a "Tunnel Intelligence" model system for the tunnel engineering industry, focusing on digital talent cultivation and the digital transformation of the entire tunnel engineering process [3]
全新GPU高速互联设计,为大模型训练降本增效!北大/阶跃/曦智提出新一代高带宽域架构
量子位· 2025-05-19 04:37
Core Viewpoint - The article discusses the limitations of existing High-Bandwidth Domain (HBD) architectures for large model training and introduces InfiniteHBD, a new architecture that addresses these limitations through innovative design and technology [1][3][4]. Group 1: Limitations of Existing HBD Architectures - Current HBD architectures face fundamental limitations in scalability, cost, and fault tolerance, with switch-centric designs being expensive and hard to scale, GPU-centric designs suffering from fault propagation issues, and hybrid designs like TPUv4 still not ideal in cost and fault tolerance [3][10][19]. - The existing architectures can be categorized into three types: switch-centric, GPU-centric, and hybrid, each with its own set of limitations regarding scalability, interconnect cost, fault explosion radius, and fragmentation [7][22]. Group 2: Introduction of InfiniteHBD - InfiniteHBD is proposed as a solution, utilizing Optical Circuit Switching (OCS) technology embedded in optical-electrical conversion modules to achieve low-cost scalability and node-level fault isolation [4][29]. - The cost of InfiniteHBD is only 31% of that of NVL-72, with near-zero GPU wastage, significantly improving Model FLOPs Utilization (MFU) by up to 3.37 times compared to traditional architectures [4][48][63]. Group 3: Key Innovations of InfiniteHBD - InfiniteHBD incorporates three key innovations: OCS-based optical-electrical conversion modules (OCSTrx), a reconfigurable K-Hop Ring topology, and an HBD-DCN orchestration algorithm [30][32][44]. - The OCSTrx allows for dynamic point-to-multipoint connections and low resource fragmentation, enhancing scalability and cost-effectiveness [29][35]. Group 4: Performance Evaluation - The performance evaluation of InfiniteHBD shows it can effectively meet the dual demands of computational efficiency and communication performance for large-scale training of language models [65]. - The orchestration algorithm optimizes communication efficiency, significantly reducing cross-Top of Rack (ToR) traffic and demonstrating resilience against node failures [68][70]. Group 5: Cost and Energy Efficiency - InfiniteHBD exhibits significant advantages in interconnect cost and energy consumption, with interconnect costs being 31% of NVL-72 and energy consumption being 75% of NVL-72, while maintaining low energy levels comparable to TPUv4 [74].
电子行业跟踪周报:架构级创新,华为UBMesh直击大模型训练的“通信墙”与成本痛点-20250511
Soochow Securities· 2025-05-11 14:05
Investment Rating - The report maintains an "Add" investment rating for the electronic industry, indicating a positive outlook for the sector over the next six months [1]. Core Insights - Huawei's UB Mesh architecture addresses the cost and performance challenges associated with large model training, achieving a 2.04 times cost efficiency improvement compared to the Clos architecture [3]. - The UB Mesh architecture utilizes a 4D/5D topology, allowing for high bandwidth, low-cost, and high-reliability AI training clusters, significantly reducing the reliance on expensive network infrastructure [6][7]. - The report highlights the potential for Huawei's Ascend 920 series chips to capture a significant market share in domestic computing power demand, driven by their innovative capabilities and the ongoing trend of domestic computing power substitution [7]. Summary by Sections Industry Trends - The UB Mesh architecture is designed to meet the demands for large-scale AI training clusters, providing a flexible multi-dimensional aggregation that reduces transmission overhead [6]. - The architecture's reliance on short-distance direct interconnections minimizes costs and enhances system reliability, with 86.7% of the system's cabling being passive [6]. Performance Comparison - Under the same training benchmarks, the UB Mesh architecture demonstrated similar performance to the Clos architecture while significantly lowering hardware costs, with a reduction in network infrastructure costs from 67% to 20% [3]. - The operational cost savings from reduced use of high-performance switches and optical modules amounted to a 35% decrease [3]. Market Opportunities - The report identifies several companies in the supply chain that may benefit from the advancements in Huawei's technology, including SMIC, Huafeng Technology, and others [7].
CVPR Oral | 南京大学李武军教授课题组推出分布式训练算法UniAP,大模型训练最高加速3.8倍
机器之心· 2025-04-30 04:23
李武军教授为通讯作者,硕士生林昊(已毕业 ,现工作于阿里巴巴)、吴轲、李杰为共同第一作者,博士生李俊为参与作者。 训练成本高昂已经成为大模型和人工智能可持续发展的主要障碍之一。 大模型的训练往往采用多机多卡的分布式训练,大模型的分布式训练挑战巨大,即使硬件足够,不熟悉分布式训练的人大概率(实验中验证有 64%-87% 的概率)会因为超参数设置(模型怎么切分和排布、数据怎么切分和排布等)不合理而无法成功运行训练过程。 此外,不熟悉分布式训练的人在碰到大模型训练慢时容易只想到增加 GPU 硬件等 横向拓展(scale-out)方法,而忽略了分布式训练算法的 纵向拓展(scale- up)作用。 论文被 CVPR 2025 录用为 Oral(所有投稿论文的 0.7%,所有录用论文的 3.3%)。 方法简介 实际上,分布式训练算法会极大地影响硬件的算力利用率。高效能分布式训练算法具有高算力利用率。用同样的硬件算力训练同一个模型,高效能分布式训 练算法会比低效能分布式训练算法速度快,最高可能会快数倍甚至数十倍以上。 也就是说,训练同一个模型,高效能分布式训练算法会比低效能分布式训练算法成本低,最高可能会节省数倍甚至数十 ...