昇腾算力集群 - filings, earnings calls, financial reports, news

昇腾算力集群

Search documents

虎嗅APP· 2025-06-09 12:54

HUAWEI X HUXIU 在通往通用人工智能（AGI）的路上，如何像其他领域一样实现弯道超车，是业界绕不开的话题。在过去的十余年时间里，各项单点技术飞速演进，但随着单点技术演进的边际效应递减和系统复杂度的提升，系统性能的天花板逐步从单点技术的上限演变成系统工程上限：单点优势越来越像是精致的零件，提升空间有限；但采用系统工程创新，各个部分完美配合、高效协同，实现整个系统的效能最优，才有更积极的现实意义。如何在发挥单点技术优势的同时，以整体视角重新构建路径，通过对复杂系统的极致把控与再组织、找到新的突破可能？解决这个看似不可能的问题，就有望为我们独立引领最前沿技术发展创造条件。近期，虎嗅将推出《华为技术披露集》系列内容，通过一系列技术报告，首次全面详述相关技术细节，为业界提供参考价值。我们期待通过本系列内容，携手更多伙伴共同构建开放协作的生态系统，助力昇腾生态在中国的蓬勃发展。《华为技术披露集》系列 VOL.13 ：万卡集群你是否注意到，现在的 AI 越来越 "聪明" 了？能写小说、做翻译、甚至帮医生看 CT 片，这些能力背后离不开一个默默工作的 "超级大脑工厂"——AI 算力集 ...

让算力航母稳健远航，华为首次披露昇腾算力基础设施的压舱石

21世纪经济报道· 2025-06-09 12:08

Core Viewpoint - The article discusses the advancements in AI computing clusters, emphasizing their critical role in enhancing the capabilities of AI models through innovative engineering solutions and fault tolerance mechanisms [1]. Group 1: Supernode High Availability - AI training and inference require continuous operation, with each computer in the cluster having a backup to ensure seamless task execution during failures [1]. - Huawei's fault tolerance solutions include system-level, business-level, and operational-level strategies to manage faults gracefully [1]. Group 2: Cluster Linearity - The ideal scenario for computing clusters is linear scalability, where the performance increases proportionally with the number of computers [1]. - Huawei employs advanced task allocation algorithms and technologies to achieve high linearity in model training, with results showing linearity rates of 96% for various configurations [1]. Group 3: Rapid Recovery in Large-Scale Training - The system can automatically save training progress, allowing for quick recovery from failures without starting over [1]. - Innovations include process-level rescheduling and online recovery techniques that significantly reduce recovery times to under 3 minutes [1]. Group 4: Large-Scale MoE Model Inference Recovery - The article outlines a three-tier fault tolerance strategy for large-scale MoE model inference, minimizing user impact during hardware failures [1]. - Techniques such as rapid instance restart and token-level retries have been validated to reduce recovery times significantly [1]. Group 5: Fault Management and Diagnostic Awareness - A real-time monitoring system continuously tracks the health of each computer in the cluster, enabling quick fault detection and diagnosis [1]. - Huawei's comprehensive fault management solutions enhance reliability through advanced diagnostic capabilities and proactive maintenance strategies [1]. Group 6: Simulation Modeling - The article introduces a Markov modeling simulation platform that allows for pre-testing of AI models in a virtual environment, identifying potential bottlenecks before real-world deployment [1]. - This approach optimizes resource allocation and enhances the overall efficiency of the computing cluster [1]. Group 7: Framework Migration - Huawei's MindSpore framework supports seamless integration with mainstream ecosystems, facilitating the deployment of large models and improving inference performance [1]. - The framework includes tools for adapting third-party frameworks, ensuring compatibility and efficiency in AI model training and inference [1].