PostNAS
Search documents
英伟达新模型上线,4B推理狂飙53倍,全新注意力架构超越Mamba 2
3 6 Ke· 2025-08-27 02:03
Core Insights - Nvidia has launched a new series of small models called Jet-Nemotron, developed by an all-Chinese team, featuring innovations such as Post Neural Architecture Search (PostNAS) and a new linear attention module called JetBlock [1][2][8] - Jet-Nemotron models (2B and 4B) outperform leading open-source models like Qwen3, Gemma3, and Llama3.2 in various dimensions including math, code, commonsense, retrieval, and long context accuracy [2][20] - The inference throughput on H100 GPUs has been significantly enhanced, achieving up to a 53.6 times increase [4][20] Model Performance - Jet-Nemotron-2B and Jet-Nemotron-4B demonstrate superior performance in benchmark tests, with Jet-Nemotron-4B achieving a 65.2% accuracy in MMLU, compared to Qwen3's 60.3% [21] - In long context scenarios, Jet-Nemotron shows a dramatic throughput increase, reaching up to 50 times improvement over Qwen3-1.7B [5][20] - The models also exhibit faster speeds, with Jet-Nemotron-2B being 21 times faster and Jet-Nemotron-4B 47 times faster than Qwen3-1.7B-Base [20] Innovations - PostNAS allows for efficient architecture exploration and adaptation based on pre-trained Transformer models, significantly reducing the cost and risk of developing new language model architectures [9][10][14] - JetBlock, a new linear attention module, combines dynamic convolution with hardware-aware architecture search, leading to substantial accuracy improvements while maintaining similar training and inference throughput as previous designs [18][20] Technical Specifications - Jet-Nemotron models have been optimized for various parameters, including cache size and throughput, with configurations achieving a maximum throughput of 2,885 tokens per second [21] - The models utilize a flexible design for attention blocks, allowing for improved performance in long context and complex reasoning tasks [16][18]
英伟达再出手!新型混合架构模型问世,两大创新实现53.6倍吞吐提速
机器之心· 2025-08-26 09:38
Core Insights - The article introduces Jet-Nemotron, a new hybrid architecture language model developed by researchers from NVIDIA, which achieves state-of-the-art (SOTA) accuracy while significantly improving efficiency compared to existing full-attention models [2][8][9]. Model Performance - Jet-Nemotron-2B outperforms several leading open-source full-attention models, including Qwen3, Qwen2.5, Gemma3, and Llama3.2, while achieving a throughput acceleration of up to 53.6 times on H100 GPUs with a context length of 256K and maximum batch size [2][9]. - In benchmark tests such as MMLU and MMLU-Pro, Jet-Nemotron's accuracy surpasses that of advanced MoE full-attention models, despite those models having larger parameter sizes [2][5]. Innovations and Techniques - Jet-Nemotron is built on two core innovations: Post Neural Architecture Search (PostNAS) and JetBlock, a new linear attention module that significantly enhances performance compared to previous designs like Mamba2 [6][21]. - PostNAS allows for efficient architecture exploration and adaptation on pre-trained Transformer models, reducing the cost and risk associated with developing new language model architectures [12][16]. Efficiency and Accuracy - The architecture of Jet-Nemotron enables immediate improvements in efficiency and accuracy, leading to better service quality and reduced operational costs [17]. - The hardware-aware search conducted by PostNAS identifies architectures that maintain similar throughput while achieving higher accuracy with more parameters [18]. Comparative Results - Jet-Nemotron-2B and Jet-Nemotron-4B demonstrate competitive accuracy against leading efficient language models, with Jet-Nemotron-4B being 21 times faster and Jet-Nemotron-2B being 47 times faster than Qwen3-1.7B-Base [23][24].