混合专家(MoE)

Search documents
Cursor为Blackwell从零构建MXFP8内核,MoE层提速3.5倍,端到端训练提速1.5倍
机器之心· 2025-08-22 04:58
Core Insights - The article discusses the challenges and solutions encountered by Cursor when upgrading from NVIDIA's Hopper H100s to the new Blackwell B200s GPU architecture, highlighting the inefficiencies in the MoE (Mixture of Experts) training layer that hindered performance despite hardware improvements [2][20]. Group 1: Performance Bottlenecks - The upgrade to Blackwell B200s resulted in a hardware performance increase, but the actual training speed was slowed down by inefficiencies in the MoE layer, leading to a paradox where performance gains were not realized [2]. - Cursor's solution involved rewriting the MoE training layer from scratch at the GPU kernel level, which eliminated bottlenecks and fully utilized the Blackwell architecture's potential [2][21]. Group 2: Technical Innovations - Cursor designed a data flow pipeline specifically targeting TMEM's new features to avoid unnecessary register movement overhead, integrating quantization and dequantization logic into the kernel computation process to significantly reduce memory bandwidth usage [3][9]. - The MXFP8 quantization method was developed to maintain precision while benefiting from low-precision computation, allowing for effective scaling of data blocks [11][24]. Group 3: Performance Metrics - The MoE layer achieved a 3.5x speedup in both forward and backward propagation, with end-to-end training speed on Blackwell being 1.5x faster compared to the original Hopper GPU setup, resulting in a total acceleration of 2x [2]. - The throughput for FP8 Tensor Core on Blackwell reached 4,500 TFLOP/s, while the FP32 CUDA Core throughput was 80 TFLOP/s, indicating significant improvements in processing capabilities [16]. Group 4: Optimization Strategies - Cursor implemented a complex data pipeline utilizing techniques such as "Warp specialization" and 2-CTA (Cooperative Thread Array) mode, which allowed for efficient parallel processing and reduced memory traffic, leading to a 15-20% performance improvement [22][23]. - The custom MXFP8 quantization kernel developed by Cursor achieved a sustained memory bandwidth of over 6.2 TB/s, outperforming existing open-source tools [24][26]. Group 5: Training Efficiency - The training loss curves for MXFP8 and BF16 formats showed nearly indistinguishable results, indicating that performance enhancements did not compromise accuracy [27][30]. - The quantization process was identified as a significant performance killer, with the overhead of data quantization and dequantization consuming a large portion of the computation time [17][18].
理想的VLA可以类比DeepSeek的MoE
理想TOP2· 2025-06-08 04:24
Core Viewpoint - The article discusses the advancements and innovations in the VLA (Vision Language Architecture) and its comparison with DeepSeek's MoE (Mixture of Experts), highlighting the unique approaches and improvements in model architecture and training processes. Group 1: VLA and MoE Comparison - Both VLA and MoE have been previously proposed concepts but are now being fully realized in new domains with significant innovations and positive outcomes [2] - DeepSeek's MoE has improved upon traditional models by increasing the number of specialized experts and enhancing parameter utilization through Fine-Grained Expert Segmentation and Shared Expert Isolation [2] Group 2: Key Technical Challenges for VLA - The VLA needs to address six critical technical points, including the design and training processes, 3D spatial understanding, and real-time inference capabilities [4] - The design of the VLA base model requires a focus on sparsity to expand parameter capacity without significantly increasing inference load [6] Group 3: Model Training and Efficiency - The training process incorporates a significant amount of 3D data and driving-related information while reducing the proportion of historical data [7] - The model is designed to learn human thought processes, utilizing both fast and slow reasoning methods to balance parameter scale and real-time performance [8] Group 4: Diffusion and Trajectory Generation - Diffusion techniques are employed to decode action tokens into driving trajectories, enhancing the model's ability to predict complex traffic scenarios [9] - The use of an ODE sampler accelerates the diffusion generation process, allowing for stable trajectory generation in just 2-3 steps [11] Group 5: Reinforcement Learning and Model Training - The system aims to surpass human driving capabilities through reinforcement learning, addressing previous limitations related to training environments and information transfer [12] - The model has achieved end-to-end trainability, enhancing its ability to generate realistic 3D environments for training [12] Group 6: Positioning Against Competitors - The company is no longer seen as merely following Tesla in the autonomous driving space, especially since the introduction of V12, which marks a shift in its approach [13] - The VLM (Vision Language Model) consists of fast and slow systems, with the fast system being comparable to Tesla's capabilities, while the slow system represents a unique approach due to resource constraints [14] Group 7: Evolution of VLM to VLA - The development of VLM is viewed as a natural evolution towards VLA, indicating that the company is not just imitating competitors but innovating based on its own insights [15]
Linear-MoE:线性注意力遇上混合专家的开源实践
机器之心· 2025-05-29 11:38
Core Insights - The article highlights the rise of Linear-MoE architecture, which effectively combines linear sequence modeling and Mixture-of-Experts (MoE) for enhanced performance in large language models [1][10]. Group 1: Linear Sequence Modeling - Significant advancements in linear sequence modeling have been achieved over the past two years, characterized by linear time complexity in training and constant memory usage during inference [5]. - The main categories of linear sequence modeling include Linear Attention, State Space Models (SSM), and Linear RNN, with notable works such as Lightning Attention, GLA, Mamba2, and RWKV [5]. Group 2: Mixture-of-Experts (MoE) - MoE has become a standard in the industry, with various models like GPT-4, Gemini, and domestic models such as DeepSeek and Qwen all adopting MoE architectures [8]. - The importance of MoE in enhancing model capabilities is emphasized, although the article does not delve deeply into this aspect [8]. Group 3: Linear-MoE Architecture - Linear-MoE offers a complete system from modeling to training, allowing flexible combinations of linear sequence modeling layers and MoE layers, while also being compatible with traditional Softmax Attention Transformer layers [10]. - Key features include a modular architecture with support for various linear modeling methods and multiple MoE implementations, ensuring stability and scalability through the Megatron-Core framework [10]. Group 4: Performance and Future Prospects - Large-scale experiments validate the superiority of Linear-MoE, demonstrating faster inference speeds (2-5 times quicker than traditional architectures) and over 50% reduction in memory usage [12][13]. - The open-source nature of Linear-MoE fills a technical gap and provides reproducible training solutions, with future exploration planned for applications in long-context understanding and Vision-Language model architectures [13].