后神经架构搜索
Search documents
英伟达再出手!新型混合架构模型问世,两大创新实现53.6倍吞吐提速
机器之心· 2025-08-26 09:38
Core Insights - The article introduces Jet-Nemotron, a new hybrid architecture language model developed by researchers from NVIDIA, which achieves state-of-the-art (SOTA) accuracy while significantly improving efficiency compared to existing full-attention models [2][8][9]. Model Performance - Jet-Nemotron-2B outperforms several leading open-source full-attention models, including Qwen3, Qwen2.5, Gemma3, and Llama3.2, while achieving a throughput acceleration of up to 53.6 times on H100 GPUs with a context length of 256K and maximum batch size [2][9]. - In benchmark tests such as MMLU and MMLU-Pro, Jet-Nemotron's accuracy surpasses that of advanced MoE full-attention models, despite those models having larger parameter sizes [2][5]. Innovations and Techniques - Jet-Nemotron is built on two core innovations: Post Neural Architecture Search (PostNAS) and JetBlock, a new linear attention module that significantly enhances performance compared to previous designs like Mamba2 [6][21]. - PostNAS allows for efficient architecture exploration and adaptation on pre-trained Transformer models, reducing the cost and risk associated with developing new language model architectures [12][16]. Efficiency and Accuracy - The architecture of Jet-Nemotron enables immediate improvements in efficiency and accuracy, leading to better service quality and reduced operational costs [17]. - The hardware-aware search conducted by PostNAS identifies architectures that maintain similar throughput while achieving higher accuracy with more parameters [18]. Comparative Results - Jet-Nemotron-2B and Jet-Nemotron-4B demonstrate competitive accuracy against leading efficient language models, with Jet-Nemotron-4B being 21 times faster and Jet-Nemotron-2B being 47 times faster than Qwen3-1.7B-Base [23][24].
英伟达韩松团队新作:具有后神经架构搜索的高效语言模型
量子位· 2025-08-26 08:11
Core Insights - The article discusses the launch of Jet-Nemotron, a new efficient language model based on Post Neural Architecture Search, which outperforms existing models in various benchmarks and achieves significant speed improvements in throughput [1][6][24]. Performance Metrics - Jet-Nemotron-2B shows a throughput increase of 47 times compared to Qwen3-1.7B-Base, with a cache size reduced to 1/47 [3]. - In mathematical tasks, Jet-Nemotron-2B achieves an average accuracy of 49.6, surpassing Qwen3-1.7B-Base by 6.3 points while being 47 times faster [26]. - For common sense reasoning tasks, Jet-Nemotron-2B reaches an average accuracy of 62.0, outperforming all baseline models [30]. - In retrieval tasks, Jet-Nemotron-2B performs better than all baseline models except Qwen3-1.7B-Base [33]. - Jet-Nemotron-4B achieves a peak average accuracy of 76.2 while maintaining a 21 times speed advantage over Qwen3 [34]. Model Architecture - Jet-Nemotron is built on Post Neural Architecture Search, which optimizes the placement of attention layers and selects the best linear attention modules [6][10]. - The model incorporates a new linear attention module called JetBlock, which uses a kernel generator for dynamic convolution kernel generation [17][18]. - Hardware-aware architecture search is employed to optimize model parameters for better accuracy without compromising throughput [19][22]. Future Developments - The research team plans to release the code and model on GitHub, pending legal compliance review [23].