Core Viewpoint - Nvidia is aggressively advancing in open-source models with the introduction of the "most efficient open model family" Nemotron 3, utilizing a hybrid Mamba-Transformer MoE architecture and NVFP4 low-precision training [1][22]. Group 1: Model Architecture and Efficiency - Nemotron 3 combines Mamba and Transformer architectures to maximize inference efficiency [7]. - The model architecture features a unique arrangement of Mamba-2 layers and MoE layers, significantly reducing the reliance on self-attention layers [10]. - In typical inference scenarios with 8k input and 16k output, Nemotron 3 Nano 30B-A3B achieves a throughput 3.3 times greater than Qwen3-30B-A3B, with advantages becoming more pronounced as sequence length increases [12]. - The model demonstrates robust performance on long-context tasks, scoring 68.2 on the RULER benchmark with 1 million token input length, compared to only 23.43 for Nemotron 2 Nano 12B [14]. Group 2: LatentMoE Architecture - For larger models, Nvidia introduces the LatentMoE architecture, which performs expert routing in a latent space [15]. - LatentMoE addresses two bottlenecks in MoE layer deployment: low-latency scenarios and high-throughput scenarios, reducing the weight loading and communication costs significantly [16][18]. - LatentMoE utilizes 512 experts with 22 activated, compared to the standard MoE's 128 experts with 6 activated, achieving better performance across various tasks [20]. Group 3: Training Innovations - Nvidia employs NVFP4 format for training, achieving a peak throughput three times that of FP8, and has successfully trained models on up to 250 trillion tokens [22]. - The training process retains high precision for certain layers to maintain model stability, while most layers are quantized to NVFP4 [23]. - Nemotron 3's post-training utilizes multi-environment reinforcement learning, covering a wide range of tasks simultaneously, which enhances stability and avoids common issues associated with phased training [24][26]. Group 4: Performance Metrics and Open Source - The model shows consistent accuracy across various downstream tasks, with NVFP4-trained models closely matching BF16 versions in performance [28]. - The entire post-training software stack is open-sourced under the Apache 2.0 license, including NeMo-RL and NeMo-Gym repositories [32]. - Nemotron 3 allows for cognitive budget control during inference, enabling users to specify the maximum number of tokens for thought chains, thus balancing efficiency and accuracy [34].
英伟达成美国大模型开源标杆:Nemotron 3连训练配方都公开,10万亿token数据全放出