大模型训推
Search documents
以网强算,破局万亿模型训推瓶颈——新华三超节点打造AI基础设施新范式
Huan Qiu Wang· 2025-12-19 08:29
Core Insights - The article discusses the launch of the H3C UniPoD S80000 super node product by Unisplendour Corporation's subsidiary, H3C, aimed at addressing the challenges of communication walls and computing power utilization in large model training and inference [1][5]. Group 1: Product Features and Innovations - The H3C UniPoD S80000 super node utilizes a "computing power × connectivity" technology concept, achieving full interconnection of GPUs through a Scale-up architecture, resulting in an 8-fold increase in inter-card bandwidth compared to traditional 8-card servers and an 80% improvement in single-card inference efficiency [1][5]. - The super node supports liquid cooling for high-density deployment and is compatible with multiple brands of GPUs, addressing the long-term stability requirements for large model training through software and hardware collaborative optimization [1][5][7]. Group 2: Market Context and Demand - As the market for high-performance computing power surges, driven by the increasing prevalence of high-parameter MoE large models like DeepSeek, the ability to efficiently train and infer large models becomes critical for gaining a competitive edge in the rapidly evolving AI landscape [2][3]. - The article highlights the importance of building robust and efficient AI computing infrastructure, with super node products emerging as a key focus in the current computing power sector [2][3]. Group 3: Technical Advantages - The article emphasizes that traditional cross-node communication methods lead to significant communication overhead, reducing computing power utilization. The Scale-up technology enables direct high-speed communication between GPUs, significantly enhancing GPU utilization and reducing idle time [3][4]. - In the inference phase, the super node's support for independent scaling of computing and storage resources allows for efficient resource allocation, particularly in scenarios requiring frequent access to KV Cache, thus minimizing resource waste and ensuring low latency [4][5]. Group 4: Stability and Reliability - The H3C UniPoD S80000 is designed with a focus on stability and maintainability, crucial for preventing training interruptions that could lead to resource waste and model performance degradation. The product incorporates collaborative optimization of software and hardware to ensure uninterrupted long-term training [7][8]. - The company is actively investing in optical interconnection technology to leverage the benefits of high speed, low latency, and low energy consumption while addressing the reliability issues associated with optical components [7][9]. Group 5: Future Outlook - H3C aims to continue developing super node products that support large-scale deployments of 1024 cards and above, enhancing the scale and efficiency of intelligent computing clusters [7][8]. - The company is committed to building a strong, diverse, and continuously evolving computing infrastructure to support the AI industry's growth and transformation [8][9].
华为云:CloudMatrix384突破大模型训推瓶颈,加速行业智能化跃迁
Sou Hu Cai Jing· 2025-06-24 11:58
Core Insights - The Huawei Developer Conference 2025 featured a summit focused on the "CloudMatrix384 Ascend AI Cloud Service," highlighting its role in accelerating AI innovation across industries through overcoming computational, operational, and storage bottlenecks [1][8]. Group 1: AI Infrastructure Standards - The rapid evolution of AI large models presents challenges in computational, operational, and storage capabilities, which are referred to as the "computational wall," "communication wall," and "storage wall" [2]. - The CloudMatrix384 Ascend AI Cloud Service is positioned as a new standard for AI infrastructure, addressing these challenges effectively [2][6]. Group 2: Technical Features of CloudMatrix384 - The service integrates "hardware reconstruction + software intelligence" to create a high-density, high-speed, and efficient AI-native infrastructure [6]. - High-density capabilities are achieved by connecting 384 Ascend NPUs with 192 Kunpeng CPUs through the MatrixLink high-speed network, forming a "super AI server" that supports up to 160,000 nodes [6]. - High-speed communication is facilitated by the MatrixLink architecture, achieving a bandwidth of 2.8 Tb/s and reducing communication latency to nanoseconds [6]. - Efficiency is enhanced through intelligent scheduling, increasing the effective utilization of computational resources by over 50% [7]. Group 3: Industry Applications and Collaborations - The CloudMatrix384 service has been validated across various industries, with companies like Silicon Flow demonstrating significant performance improvements in AI model training and inference [12][15]. - Other companies, including Sina and iFlytek, have reported enhanced efficiency and performance in their AI applications using the CloudMatrix384 service [22]. - The service is expected to integrate deeply into sectors such as e-commerce, social media, entertainment, finance, and automotive, thereby lowering the barriers to AI innovation [22]. Group 4: Future Outlook - The summit served as a platform for showcasing technological achievements and fostering collaboration among industry players, marking the entry of AI infrastructure into the "super node era" [22]. - Huawei Cloud aims to partner with clients and stakeholders to drive industry-wide intelligent transformation [22].
华为「数字化风洞」小时级预演万卡集群方案,昇腾助力大模型运行「又快又稳」
雷峰网· 2025-06-11 11:00
Core Viewpoint - The article discusses the launch of the Ascend modeling and simulation platform, which aims to optimize the interaction between load, optimization strategies, and system architecture to enhance infrastructure performance [1]. Group 1: Challenges in AI Model Training - Over 60% of computing power is wasted due to hardware resource mismatches and system coupling, highlighting the inefficiencies in traditional optimization methods [2]. - The training process for large models is likened to "slamming the gas pedal," where the MoE model requires precise balancing of computation and memory to avoid efficiency drops [4]. - Dynamic real-time inference systems face challenges in meeting both high throughput and low latency requirements across varying task types [4]. Group 2: Solutions and Innovations - The "digital wind tunnel" allows for pre-simulation of complex AI models in a virtual environment, enabling the identification of bottlenecks and optimization strategies before real-world implementation [6]. - The Sim2Train framework enhances the efficiency of large-scale training clusters through automatic optimization of deployment space and dynamic performance awareness, achieving a 41% improvement in resource utilization [7]. - The Sim2Infer framework focuses on real-time optimization of inference systems, resulting in over 30% performance improvement through adaptive mixed-precision inference and global load balancing [8]. Group 3: High Availability and Reliability - The Sim2Availability framework ensures high availability of the Ascend computing system, achieving a 98% uptime and rapid recovery from failures through advanced optimization techniques [11]. - The system employs a comprehensive monitoring approach to track hardware states and optimize software fault management, enhancing overall system reliability [13]. Group 4: Future Outlook - As new applications evolve, the demand for innovative system architectures will increase, necessitating continuous advancements in modeling and simulation methods to support the development of computing infrastructure [16].
从 DeepSeek 部署看,华为如何让 MOE 架构“迎来”海量“专家”?
AI前线· 2025-05-22 04:30
Core Viewpoint - The development of models has shifted from early algorithm optimization to deep innovation at the system engineering level, transitioning from a digital era of bit traffic to a Token economy, with daily Token consumption in China rising from hundreds of billions to tens of trillions [1] Group 1: Model Optimization - Huawei has made significant optimizations for DeepSeek, focusing on three main areas to enhance compatibility and support for enterprise applications [3] - The pre-training aspect includes the implementation of DualPipe technology, which has been improved to minimize static memory usage through the introduction of the DualPipe-V solution [6] - At the operator level, Huawei has enhanced execution efficiency with the MRN PO fusion operator and optimized low-latency communication [7] Group 2: System Architecture - Huawei has developed a new architecture for inference called the "super node" architecture, which interconnects multiple GPUs to reduce communication latency and improve training throughput [14] - The Atlas 900 A3 SuperCluster has been designed to enhance cluster computing efficiency and reliability, achieving a training efficiency increase of 2.7 times [15] - The OmniPlacement algorithm has been introduced to optimize resource utilization by dynamically adapting to expert activation data, improving throughput by 10% [19] Group 3: Load Balancing and Efficiency - Huawei has implemented a large-scale expert parallel (large EP) strategy to enhance inference efficiency, achieving a nearly 20-fold increase in the past two months [17] - The company has developed dynamic priority adjustment and communication optimization strategies to address load balancing challenges in expert parallelism [20]