Workflow
零中断容错技术
icon
Search documents
国产GPU跑满血DeepSeek,已经可以100 tokens/s了!
量子位· 2025-07-26 09:01
Core Viewpoint - The fastest chip for running full-scale DeepSeek is a domestic GPU from Moore Threads, achieving a speed of 100 tokens/s, significantly faster than foreign GPUs at 50 tokens/s and domestic counterparts at 15 tokens/s [1][4]. Group 1: Moore Threads' Achievements - Moore Threads has developed an AI super factory that goes beyond just creating faster chips, focusing on a comprehensive transformation of the entire technology stack [6][10]. - The AI super factory is not a physical chip manufacturing facility but a systemic overhaul that includes innovations in chip architecture, cluster design, and software algorithms [9][10]. Group 2: Key Components of the AI Super Factory - The AI super factory's production efficiency is defined by five core elements: generality of accelerated computing, effective chip performance, node efficiency, cluster efficiency, and cluster stability [13]. - A full-function GPU serves as the foundation of the AI super factory, evolving from basic graphics acceleration to a versatile computing platform capable of handling various AI tasks [14][16]. Group 3: MUSA Architecture - The MUSA architecture acts as the "chief designer" of the super factory, allowing for scalable and configurable chip designs that optimize resource allocation [25][26]. - MUSA's innovative design enables global resource sharing, reducing bottlenecks and improving efficiency during multi-task operations [27][29]. Group 4: Full-Stack Software System - Moore Threads has created a full-stack software system that integrates deeply with the MUSA hardware architecture, enhancing developer experience and operational efficiency [35][36]. - The software stack includes optimized drivers, core operator libraries, and tools for performance analysis, significantly improving task handling and resource utilization [41][42]. Group 5: KUAE Computing Cluster - The KUAE computing cluster is a soft-hard integrated system that extends the performance advantages of individual GPUs to large-scale deployments, enabling efficient training of massive AI models [43][44]. - The cluster supports various parallel training strategies and provides end-to-end training optimization, ensuring high performance and stability [45][46]. Group 6: Zero-Interrupt Fault Tolerance Technology - Moore Threads has developed a unique zero-interrupt fault tolerance technology that allows for continuous operation of the AI super factory, minimizing downtime and recovery costs [47][49]. - This technology enhances the overall stability and reliability of the system, ensuring high effective training time and reducing the impact of potential failures [51][52]. Group 7: Future of AI and Computing Needs - The demand for computing power is expected to grow exponentially, driven by advancements in generative AI and the need for complex task execution [54][56]. - Moore Threads aims to provide a comprehensive solution that addresses the challenges of AI model training, emphasizing the importance of stability, reliability, and efficiency in future computing [58][61].