大模型训练与推理
Search documents
聊一聊刚刚曝光参数的摩尔线程S5000
傅里叶的猫· 2026-02-14 15:13
Core Viewpoint - The MTT S5000, developed by Moore Threads, is positioned as a competitive GPU for large model training and inference, showcasing performance that rivals international flagship products, marking a significant advancement in domestic computing power capabilities [1][3]. Group 1: MTT S5000 Performance - The MTT S5000 features a single card AI computing power of 1000 TFlops with liquid cooling and 920 TFlops with air cooling, alongside 80 GB of memory and a memory bandwidth of 1.6 TB/s [4]. - The S5000's performance has been reported to match or even exceed that of NVIDIA's H100 in certain multi-modal large model fine-tuning tasks [4][6]. - The architecture utilizes the fourth-generation MUSA architecture, optimized for large-scale AI training, and supports full precision calculations from FP8 to FP64 [6]. Group 2: Cluster Performance - The Kua'e Wan Card cluster built on the S5000 achieves a floating-point operation capability of 10 Exa-Flops, with an MFU of 60% in Dense model training and around 40% in MoE models, maintaining over 90% effective training time [8]. - The S5000 employs unique ACE technology for communication tasks, allowing for zero-conflict parallel computing and significantly enhancing model computing power utilization [10]. Group 3: Training and Inference Cases - In January 2026, the Zhiyuan Research Institute completed end-to-end training and alignment verification of the RoboBrain 2.5 model using a thousand-card cluster based on the S5000, achieving a training loss difference of only 0.62% compared to NVIDIA's H100 cluster [10]. - In December 2025, Moore Threads, in collaboration with Silicon-based Flow, conducted performance testing on the DeepSeek-V3 671B model using the S5000, achieving a record-breaking inference throughput of over 4000 tokens/s for Prefill and over 1000 tokens/s for Decode [12].
拆开“超节点”的伪装:没有内存统一编址,仍是服务器堆叠
3 6 Ke· 2026-02-02 08:05
Core Insights - The AI industry is shifting from merely stacking server hardware to a system-level competition, focusing on underlying computing architectures as the demand for trillion-parameter multimodal models becomes the norm [1] Group 1: The Rise of Supernodes - "Supernodes" have emerged as a new trend in the computing industry, with over ten domestic companies launching their versions, although many are merely repackaged traditional server stacks [2] - The concept of "supernodes" is often misrepresented, as many do not achieve the critical technical requirement of "unified memory addressing," leading to concerns about authenticity [2] Group 2: Communication Barriers - The need for supernodes arises from the "communication wall," which limits computational efficiency in large model training due to increased communication frequency and latency [3] - Three main barriers are identified: the communication wall, power and cooling wall, and complexity wall, all of which hinder the performance of traditional cluster architectures in the context of large models [3] Group 3: Technical Challenges - The traditional cluster architecture follows a "storage-compute separation" principle, leading to significant delays in data transfer between GPUs, which is inefficient for large model training [6][10] - The process of data transfer involves multiple steps that introduce latency, making it unsuitable for the high-frequency synchronization required in large model training [10] Group 4: Unified Memory Addressing - Unified memory addressing is crucial for breaking the communication wall, allowing for a global virtual address space where all memory resources are accessible without the overhead of traditional data transfer methods [12] - Achieving unified memory addressing requires advancements in communication protocols and cache coherence, which are currently lacking in many so-called "supernodes" [13][19] Group 5: Value of Supernodes - Unified memory addressing has proven to provide significant benefits in practical applications, such as model training, where it allows for better memory management and increased utilization of computational resources [20][23] - In model inference, unified memory addressing enables global pooling of key-value caches, improving throughput performance significantly [26] - For recommendation systems, it reduces communication delays and enhances efficiency by allowing direct memory access across nodes [30] Group 6: Conclusion - The competition in AI infrastructure has evolved from simple hardware stacking to a focus on architectural design, with unified memory addressing being a key capability for next-generation computing paradigms [31]
超2300倍认购!壁仞科技IPO引爆港股,创一年来散户申购纪录
Ge Long Hui· 2026-01-01 04:08
Core Viewpoint - Wallen Technology (6082.HK) has successfully priced its IPO at HKD 19.60 per share, raising approximately HKD 5.583 billion, marking it as the largest fundraising project since the implementation of Chapter 18C of Hong Kong's listing rules [1] Group 1: IPO Details - The IPO received overwhelming response, with 471,000 retail investors participating, making it the most subscribed new stock in the Hong Kong market over the past year [1] - Wallen Technology is set to list on January 2, 2026, becoming the first new stock to be listed on the Hong Kong Stock Exchange in 2026 [1] Group 2: Fund Utilization - Approximately 85% of the net proceeds from the IPO will be allocated to research and development, focusing on next-generation product iterations and technological innovations [1] - About 5% of the funds will be used for commercial expansion, while 10% will be reserved for working capital and general corporate purposes, indicating a clear strategic plan for future growth [1] Group 3: Product Development - The next flagship chip, BR20X, is planned for commercialization in 2026, featuring significant upgrades in computing power, memory capacity, and interconnect bandwidth, along with enhanced native support for broader data formats like FP8 and FP4 [1] - Initial development stages for the BR30X chip, aimed at cloud training and inference, and the BR31X chip for edge inference have commenced, with expected launches in 2028, indicating a continuous product pipeline that will open up growth opportunities [1]