Core Insights - The article discusses the limitations of small language models (SLMs) in terms of speed and performance, revealing that smaller models do not necessarily lead to lower latency or higher throughput when deployed on GPUs [2][9][10] - NVIDIA's Nemotron-Flash model addresses these issues by prioritizing real GPU latency in its design, achieving state-of-the-art accuracy while maintaining low latency and high throughput [2][21] Group 1: Reasons for Slow Performance of Small Models - Small models are often deep and narrow, which increases latency due to frequent kernel scheduling on GPUs, contradicting the expectation that smaller models would be faster [9] - The attention mechanism remains a significant bottleneck for achieving high throughput, with a lack of systematic methods to determine the optimal use of attention versus linear attention in model layers [10] - Training of small models often leads to premature stagnation, where weight scaling and effective gradient descent hinder performance, limiting the model's capacity to improve [10][11] Group 2: Core Methodology of Nemotron-Flash - The model optimizes the depth-width ratio, balancing the need for depth to maintain expressiveness and width to reduce latency, identifying a "golden point" for optimal structure [14] - It employs a mixed operator structure that defines clear roles for different operators, enhancing collaboration between them rather than simply replacing one with another [16] - Weight normalization is applied during training to prevent the formation of structured outliers in weight matrices, allowing for sustained learning and improved convergence quality [20] Group 3: Performance of Nemotron-Flash - The Nemotron-Flash-1B model shows a 5.5% accuracy improvement over Qwen3-0.6B, with a 1.9× faster inference latency and a maximum throughput increase of 45.6× [24] - The Nemotron-Flash-3B model achieves accuracy improvements of 2% to 5.5% compared to Qwen2.5-3B and Qwen3-1.7B, with latency reductions of 1.3× to 1.7× and throughput enhancements of 6.4× to 18.7× [24] - The model's design enables scalable deployment in various applications, providing reliable and low-latency experiences in high-demand scenarios such as online services and edge devices [25] Conclusion - The future of small models lies not in being smaller but in being faster, more stable, and stronger, with Nemotron-Flash offering a new foundational logic for small model design [27]
NeurIPS 2025 | 英伟达发布Nemotron-Flash:以GPU延迟为核心重塑小模型架构