聊一聊刚刚曝光参数的摩尔线程S5000

Core Viewpoint - The MTT S5000, developed by Moore Threads, is positioned as a competitive GPU for large model training and inference, showcasing performance that rivals international flagship products, marking a significant advancement in domestic computing power capabilities [1][3]. Group 1: MTT S5000 Performance - The MTT S5000 features a single card AI computing power of 1000 TFlops with liquid cooling and 920 TFlops with air cooling, alongside 80 GB of memory and a memory bandwidth of 1.6 TB/s [4]. - The S5000's performance has been reported to match or even exceed that of NVIDIA's H100 in certain multi-modal large model fine-tuning tasks [4][6]. - The architecture utilizes the fourth-generation MUSA architecture, optimized for large-scale AI training, and supports full precision calculations from FP8 to FP64 [6]. Group 2: Cluster Performance - The Kua'e Wan Card cluster built on the S5000 achieves a floating-point operation capability of 10 Exa-Flops, with an MFU of 60% in Dense model training and around 40% in MoE models, maintaining over 90% effective training time [8]. - The S5000 employs unique ACE technology for communication tasks, allowing for zero-conflict parallel computing and significantly enhancing model computing power utilization [10]. Group 3: Training and Inference Cases - In January 2026, the Zhiyuan Research Institute completed end-to-end training and alignment verification of the RoboBrain 2.5 model using a thousand-card cluster based on the S5000, achieving a training loss difference of only 0.62% compared to NVIDIA's H100 cluster [10]. - In December 2025, Moore Threads, in collaboration with Silicon-based Flow, conducted performance testing on the DeepSeek-V3 671B model using the S5000, achieving a record-breaking inference throughput of over 4000 tokens/s for Prefill and over 1000 tokens/s for Decode [12].