Workflow
昇思MindSpore
icon
Search documents
华为突破制裁的密码,藏在“384超节点”中
虎嗅APP· 2025-06-17 10:55
Core Viewpoint - The article discusses the challenges and strategies in achieving breakthroughs in artificial intelligence (AI) technology, particularly through the development of Huawei's "CloudMatrix 384 Super Node" computing cluster solution, which aims to overcome limitations in single-point technology by leveraging system engineering innovations [1][3]. Group 1: Huawei's Technological Advancements - Huawei's "CloudMatrix 384 Super Node" is built on 384 Ascend chips and can provide up to 300 PFLOPs of dense BF16 computing power, surpassing NVIDIA's B200 NVL 72 platform [3][4]. - The development of the "Super Node" reflects Huawei's foresight in addressing the diminishing returns of Moore's Law and the increasing costs associated with semiconductor advancements [4][9]. - The architecture of the "Super Node" features a fully interconnected high-speed bus system, enhancing communication bandwidth by 15 times and reducing latency significantly [8][9]. Group 2: System Engineering Innovations - Huawei's approach involves a comprehensive system-level redesign to address challenges in large-scale model training, focusing on resource allocation and communication efficiency [5][10]. - The implementation of global memory unified addressing allows for direct memory access across nodes, improving the efficiency of parameter synchronization during model training [8][9]. - The resource scheduling has been upgraded to enable dynamic task distribution based on model structure, optimizing computation and communication time [8][10]. Group 3: Collaborative Ecosystem Development - Huawei has mobilized a large team across various departments to enhance collaboration and innovation in AI infrastructure, showcasing a unique multi-industry cluster advantage [10][12]. - The company emphasizes the importance of ecosystem compatibility, ensuring that its Ascend architecture supports popular deep learning frameworks like PyTorch and TensorFlow [12][13]. - Huawei's commitment to improving the usability of its AI frameworks, such as MindSpore, aims to facilitate a smoother transition for developers accustomed to existing platforms [12][13]. Group 4: Future Prospects and Industry Impact - The advancements in Huawei's computing capabilities are positioned as a significant step for China's AI industry, potentially overcoming technological limitations and fostering innovation [12][13]. - The ongoing development of the Ascend ecosystem is expected to take time, but efforts are being made to enhance compatibility and support for developers [12][13]. - Huawei's recent achievements in large model training, including the Pangu Ultra MoE model, demonstrate the potential of its domestic computing platform to produce world-class AI models [10][12].
从开源共建到生态繁荣:昇思MindSpore支持Day0迁移、一键部署
财联社· 2025-06-12 10:59
Core Viewpoint - The article emphasizes the rapid development of large models and the need for efficient migration and deployment solutions in the AI ecosystem, particularly through the use of MindSpore, which aims to facilitate seamless integration and performance optimization for developers [1][2]. Group 1: Migration Challenges - The first challenge is fast migration, enabling zero-cost migration of third-party framework models while ensuring complete alignment in model accuracy. MindSpore achieves this through a threefold compatibility approach, allowing for zero-code migration of mainstream models and improving training performance by 5% while maintaining distributed parallel strategies [4]. - The second challenge is rapid deployment, automating the entire training-to-inference process to make large model deployment as simple as executing a single command [2]. Group 2: Training and Inference Solutions - MindSpore supports Day 0 migration for training, providing a "no-sense intelligent translation" capability across frameworks. It utilizes tools like MindSpeed/Megatron for seamless PyTorch model migration, achieving near-zero migration loss for popular models [4]. - In inference deployment, the vLLM-MindSpore plugin allows for HuggingFace models to be deployed in under 30 minutes, with an 80% reduction in weight loading time for large models [5][6]. Group 3: Open Source and Community Engagement - Since its open-source inception on March 28, 2020, MindSpore has fostered a vibrant developer community, with over 1.2 million downloads and contributions from more than 46,000 developers across 2400 cities [7]. - The company promotes a collaborative ecosystem through community governance, providing free computing resources and knowledge sharing across 20+ technical special interest groups (SIGs) [8].
Day0迁移、一键部署,昇思MindSpore打造昇腾的“咖啡伴侣”
21世纪经济报道· 2025-06-12 10:17
Core Viewpoint - The article emphasizes the rapid development of large models and the need for efficient migration and deployment solutions in the AI ecosystem, highlighting the capabilities of MindSpore in facilitating these processes for developers [1][2]. Group 1: Migration Capabilities - MindSpore supports Day0 migration for training, enabling seamless cross-framework "intelligent translation" capabilities, allowing zero-code migration of mainstream models with a performance improvement of over 5% in distributed training scenarios [3][4]. - The framework utilizes dynamic graph compilation optimization, enhancing single-card training efficiency by 40%, and implements distributed intelligent tuning to overcome training bottlenecks, achieving a linearity breakthrough of 96% [4]. Group 2: Deployment Efficiency - MindSpore enables one-click deployment for inference, allowing model services to be launched in minutes, with support for direct loading of Hugging Face weights without format conversion [5]. - The deployment process is optimized, reducing weight loading time by 80% for models with hundreds of billions of parameters, and achieving millisecond-level graph compilation delays [5]. Group 3: Open Source Ecosystem - Since its inception in March 2020, MindSpore has developed a robust open-source ecosystem, with over 50 mainstream large models and 12 million downloads, engaging over 46,000 developers across 130 countries [7][8]. - The company promotes community governance through a dual-driven model of a council and SIG groups, providing free computing resources and knowledge sharing opportunities for developers [8].
Day0迁移、一键部署,华为开源的昇思MindSpore成为大模型开发的“万能钥匙”
量子位· 2025-06-12 08:17
Core Viewpoint - The consensus in the AI large model era is that no single large model can dominate the market, leading to challenges for developers in navigating various mainstream models and AI technologies [1][2]. Group 1: MindSpore Overview - Huawei's open-source MindSpore offers a solution for developers to experience and migrate mainstream state-of-the-art (SOTA) large models with minimal code changes, ensuring precision and performance remain intact [3][4]. - The training to inference process is fully automated, allowing over 20 mainstream large models to be deployed out of the box, with loading times for models with billions of parameters under 30 seconds [5][19]. Group 2: Migration and Deployment Features - MindSpore's "translation tool" MSAdapter enables seamless migration of code from other frameworks to MindSpore, achieving nearly zero loss during the transition [8][10]. - The tool can automatically convert over 95% of interfaces, maintaining a user-friendly experience similar to the original framework [10]. Group 3: Technical Enhancements - MindSpore employs several unique techniques to accelerate training and debugging, including multi-stage processing of operators, JIT compilation for efficient code execution, and automatic strategy optimization, which improved performance by 9.5% in specific training scenarios [11][13][16]. - The code modification required for distributed task initiation is minimal, with Python script changes being less than 1% and automated through patch tools [14]. Group 4: Inference Deployment - The vLLM-MindSpore plugin allows for rapid deployment of HuggingFace models, achieving service readiness in under 30 minutes [18][23]. - For large models, MindSpore has restructured the inference process, achieving a throughput of 1020 tokens per second with a latency of less than 100ms for specific models [19]. Group 5: Performance Improvements - The loading time for model weights has been reduced by 80%, with billion-parameter models loading in under 30 seconds, and graph compilation delays minimized to the millisecond range [23].
Day0迁移、一键部署,华为开源的昇思MindSpore成为大模型开发的“万能钥匙”
量子位· 2025-06-12 08:16
Core Viewpoint - The consensus in the AI large model era is that no single large model can dominate the market, leading to challenges for developers in navigating various mainstream models and AI technologies [1][2]. Group 1: MindSpore Overview - Huawei's open-source MindSpore offers a solution for developers to experience mainstream large models within a unified framework [3]. - MindSpore enables "Day0 migration" of large models with minimal code changes while maintaining accuracy and performance [4]. Group 2: Migration and Deployment Features - The inference process is automated for one-click deployment, allowing over 20 mainstream large models to be used out of the box, with loading times for models with billions of parameters under 30 seconds [5][23]. - MindSpore's "translation tool" MSAdapter allows for seamless migration of code from other frameworks, achieving nearly zero loss in performance [8][10]. Group 3: Technical Enhancements - MindSpore employs several unique techniques to accelerate training and debugging, including multi-stage processing, JIT compilation, and automatic strategy optimization, resulting in performance improvements of up to 9.5% [11][13][16]. - The code modification required for distributed task initiation is minimal, with Python script changes being less than 1% [14]. Group 4: Inference Deployment - The vLLM-MindSpore plugin facilitates the deployment of HuggingFace models within half an hour, with significant reductions in loading times and latency [18][23]. - For large models like Pangu Pro MoE 72B, the deployment can achieve a throughput of 1020 tokens per second with a latency of under 100ms [19].
独家秘籍:探索昇思MindSpore如何让SOTA模型迁得快、对得齐
雷峰网· 2025-06-12 08:16
Core Viewpoint - The article emphasizes the capabilities of MindSpore in supporting large model training and deployment, highlighting its focus on seamless migration and efficient inference processes for developers in the AI ecosystem [2][3]. Group 1: Migration and Training Efficiency - MindSpore enables "zero-cost" migration of third-party framework models, ensuring model accuracy alignment while enhancing training performance by 5% under distributed parallel strategies [8]. - The framework supports zero-code migration for PyTorch models, allowing direct execution of training scripts and achieving near-zero migration loss for mainstream models like DeepSeek and PangU [8][9]. - The technology architecture of MindSpore facilitates rapid migration and training efficiency improvements, addressing the challenges of evolving model architectures [5][9]. Group 2: Inference Deployment - MindSpore allows for one-click deployment of models, with HuggingFace models being deployed in under 30 minutes using the vLLM-MindSpore plugin [11]. - The framework supports direct loading of HuggingFace weights without format conversion, optimizing service launch times by reducing weight loading time by 80% for models with hundreds of billions of parameters [12]. - The deployment process is designed to be agile, enabling model services to be initiated in minutes [11][12]. Group 3: Open Source Ecosystem - Since its open-source launch on March 28, 2020, MindSpore has fostered a vibrant developer community, with over 1.2 million downloads and contributions from more than 46,000 developers across 130 countries [13]. - The framework promotes innovation through features like dynamic graph compilation optimization, distributed intelligent tuning, and layer-wise precision alignment, enhancing training efficiency by 40% [14]. - MindSpore's community governance model includes a council and special interest groups (SIGs) to collaboratively define technical directions and share resources [15].
独家秘籍:探索昇思MindSpore如何让SOTA模型迁得快、对得齐
雷峰网· 2025-06-12 08:15
Core Viewpoint - The article emphasizes the rapid evolution of large models and the need for efficient migration and deployment solutions in the AI development ecosystem, highlighting the capabilities of MindSpore in facilitating these processes. Group 1: Migration and Deployment Solutions - MindSpore supports Day0 migration for training, enabling seamless cross-framework model transfer with zero-code migration and maintaining model accuracy, achieving a 5% improvement in training performance under distributed parallel strategies [2][5]. - The deployment process is automated, allowing for quick model service initiation, with HuggingFace models being deployable in under 30 minutes using the vLLM-MindSpore plugin [6][7]. Group 2: Ecosystem and Community Engagement - Since its open-source launch on March 28, 2020, MindSpore has fostered a vibrant developer community, with over 1.2 million downloads and contributions from more than 46,000 developers across 130 countries [8][9]. - The community-driven approach includes a governance model with a council and special interest groups (SIGs) to collaboratively define technical directions [9]. Group 3: Technical Innovations - MindSpore employs advanced techniques such as multi-level pipelining and just-in-time (JIT) compilation, resulting in a 40% increase in single-card training efficiency [10]. - The platform also features automated load balancing tools to address the "bottleneck effect" in large-scale training, achieving over 96% linearity in performance [10].
华为昇腾万卡集群揭秘:如何驯服AI算力「巨兽」?
雷峰网· 2025-06-09 13:37
Core Viewpoint - The article discusses the advancements in AI computing clusters, particularly focusing on Huawei's innovations in ensuring high availability, linear scalability, rapid recovery, and fault tolerance in large-scale AI model training and inference systems [3][25]. Group 1: High Availability of Super Nodes - AI training and inference require continuous operation, similar to an emergency room, where each computer in the cluster has a backup to take over in case of failure, ensuring uninterrupted tasks [5][6]. - Huawei's CloudMatrix 384 super node employs a fault tolerance strategy that includes system-level, business-level, and operational-level fault management to convert faults into manageable issues [5][6]. Group 2: Linear Scalability - The ideal scenario for computing power is linear scalability, where 100 computers should provide 100 times the power of one. Huawei's task distribution algorithms ensure efficient collaboration among computers, enhancing performance as the number of machines increases [8]. - Key technologies such as TACO, NSF, NB, and AICT have been developed to improve the linearity of training large models, achieving linearity rates of 96% and above in various configurations [8]. Group 3: Rapid Recovery of Training - The system can quickly recover from failures during training by automatically saving progress, allowing it to resume from the last checkpoint rather than starting over [10][12]. - Innovations like process-level rescheduling and online recovery techniques have reduced recovery times to under 3 minutes and even as low as 30 seconds in some cases [12]. Group 4: Fault Tolerance in MoE Model Inference - The article outlines a three-tier fault tolerance strategy for large-scale MoE model inference, which minimizes user impact during hardware failures [14][15]. - Techniques such as instance-level rapid restart and token-level retries have significantly reduced recovery times from 20 minutes to as low as 5 minutes [15]. Group 5: Fault Management and Diagnostic Capabilities - A real-time monitoring system continuously checks the health of each computer in the cluster, allowing for quick identification and resolution of issues [16]. - Huawei's comprehensive fault management solution includes capabilities for error detection, isolation, and recovery, enhancing the reliability of the computing cluster [16]. Group 6: Simulation and Modeling - Before training complex AI models, the computing cluster can simulate various scenarios in a virtual environment to identify potential bottlenecks and optimize performance [19][20]. - The introduction of a Markov modeling simulation platform allows for efficient resource allocation and performance tuning, improving throughput and reducing communication delays [20][21]. Group 7: Framework Migration - Huawei's MindSpore framework has rapidly evolved since its open-source launch, providing tools for seamless migration from other frameworks and enhancing execution efficiency [23]. - The framework supports one-click deployment for large models, significantly improving inference performance [23]. Group 8: Future Outlook - The article concludes that the evolution of computing infrastructure will follow a collaborative path between algorithms, computing power, and engineering capabilities, potentially creating a closed loop of innovation driven by application demands [25].
华为如何驯服AI算力「巨兽」?
虎嗅APP· 2025-06-09 12:54
HUAWEI X HUXIU 在通往通用人工智能(AGI)的路上,如何像其他领域一样实现弯道超车,是业界绕不开的 话题。 在过去的十余年时间里,各项单点技术飞速演进,但随着单点技术演进的边际效应递减和系 统复杂度的提升,系统性能的天花板逐步从单点技术的上限演变成系统工程上限:单点优势 越来越像是精致的零件,提升空间有限;但采用系统工程创新,各个部分完美配合、高效协 同,实现整个系统的效能最优,才有更积极的现实意义。 如何在发挥单点技术优势的同时,以整体视角重新构建路径,通过对复杂系统的极致把控与 再组织、找到新的突破可能?解决这个看似不可能的问题,就有望为我们独立引领最前沿技 术发展创造条件。 近期,虎嗅将推出《华为技术披露集》系列内容,通过一系列技术报告,首次全面详述相关 技术细节,为业界提供参考价值。 我们期待通过本系列内容,携手更多伙伴共同构建开放协作的生态系统,助力昇腾生态在中 国的蓬勃发展。 《华为技术披露集》系列 VOL.13 :万卡集群 你是否注意到,现在的 AI 越来越 "聪明" 了?能写小说、做翻译、甚至帮医生看 CT 片,这 些能力背后离不开一个默默工作的 "超级大脑工厂"——AI 算力集 ...
让算力航母稳健远航,华为首次披露昇腾算力基础设施的压舱石
21世纪经济报道· 2025-06-09 12:08
Core Viewpoint - The article discusses the advancements in AI computing clusters, emphasizing their critical role in enhancing the capabilities of AI models through innovative engineering solutions and fault tolerance mechanisms [1]. Group 1: Supernode High Availability - AI training and inference require continuous operation, with each computer in the cluster having a backup to ensure seamless task execution during failures [1]. - Huawei's fault tolerance solutions include system-level, business-level, and operational-level strategies to manage faults gracefully [1]. Group 2: Cluster Linearity - The ideal scenario for computing clusters is linear scalability, where the performance increases proportionally with the number of computers [1]. - Huawei employs advanced task allocation algorithms and technologies to achieve high linearity in model training, with results showing linearity rates of 96% for various configurations [1]. Group 3: Rapid Recovery in Large-Scale Training - The system can automatically save training progress, allowing for quick recovery from failures without starting over [1]. - Innovations include process-level rescheduling and online recovery techniques that significantly reduce recovery times to under 3 minutes [1]. Group 4: Large-Scale MoE Model Inference Recovery - The article outlines a three-tier fault tolerance strategy for large-scale MoE model inference, minimizing user impact during hardware failures [1]. - Techniques such as rapid instance restart and token-level retries have been validated to reduce recovery times significantly [1]. Group 5: Fault Management and Diagnostic Awareness - A real-time monitoring system continuously tracks the health of each computer in the cluster, enabling quick fault detection and diagnosis [1]. - Huawei's comprehensive fault management solutions enhance reliability through advanced diagnostic capabilities and proactive maintenance strategies [1]. Group 6: Simulation Modeling - The article introduces a Markov modeling simulation platform that allows for pre-testing of AI models in a virtual environment, identifying potential bottlenecks before real-world deployment [1]. - This approach optimizes resource allocation and enhances the overall efficiency of the computing cluster [1]. Group 7: Framework Migration - Huawei's MindSpore framework supports seamless integration with mainstream ecosystems, facilitating the deployment of large models and improving inference performance [1]. - The framework includes tools for adapting third-party frameworks, ensuring compatibility and efficiency in AI model training and inference [1].