KD)
Search documents
NeurIPS 2025 Spotlight | 选择性知识蒸馏精准过滤:推测解码加速器AdaSPEC来了
机器之心· 2025-11-06 03:28
Core Insights - The article discusses the introduction of AdaSPEC, an innovative selective knowledge distillation method aimed at enhancing speculative decoding in large language models (LLMs) [3][9][16] - AdaSPEC focuses on improving the alignment between draft models and target models by filtering out difficult-to-learn tokens, thereby increasing the overall token acceptance rate without compromising generation quality [3][11][16] Research Background - LLMs excel in reasoning and generation tasks but face high inference latency and computational costs due to their autoregressive decoding mechanism [6] - Traditional acceleration methods like model compression and knowledge distillation often sacrifice generation quality for speed [6] Method Overview - AdaSPEC employs a selective token filtering mechanism that allows draft models to concentrate on "easy-to-learn" tokens, enhancing their alignment with target models [3][9] - The method utilizes a two-stage training framework: first, it identifies difficult tokens using a reference model, and then it filters the training dataset to optimize the draft model [11][12] Experimental Evaluation - The research team conducted systematic evaluations across various model families (Pythia, CodeGen, Phi-2) and tasks (GSM8K, Alpaca, MBPP, CNN/DailyMail, XSUM), demonstrating consistent and robust improvements in token acceptance rates [14] - Key experimental results indicate that AdaSPEC outperforms the current optimal DistillSpec method, with token acceptance rates increasing by up to 15% across different tasks [15] Summary and Outlook - AdaSPEC represents a precise, efficient, and universally applicable paradigm for accelerating speculative decoding, paving the way for future research and industrial deployment of efficient LLM inference [16] - The article suggests two potential avenues for further exploration: dynamic estimation mechanisms for token difficulty and application of AdaSPEC in multimodal and reasoning-based large models [17]
英伟达全新开源模型:三倍吞吐、单卡可跑,还拿下推理SOTA
量子位· 2025-07-29 05:05
Core Viewpoint - NVIDIA has launched the Llama Nemotron Super v1.5, an open-source model designed for complex reasoning and agent tasks, achieving state-of-the-art performance while tripling throughput compared to its predecessor, and efficiently running on a single GPU [2][11]. Model Introduction - Llama Nemotron Super v1.5 is an upgraded version of Llama-3.3-Nemotron-Super-49B-V1, specifically tailored for complex reasoning and intelligent agent tasks [3]. Model Architecture - The model employs Neural Architecture Search (NAS) to balance accuracy and efficiency, effectively converting throughput improvements into lower operational costs [4]. - NAS generates non-standard, non-repetitive network modules, introducing two key changes compared to traditional Transformers: - Skip attention mechanism, which bypasses the attention layer in certain modules [6]. - Variable Feedforward Network (FFN), where different modules utilize varying expansion/compression ratios [7]. Efficiency Improvements - The model reduces FLOPs by skipping attention or altering FFN widths, allowing for more efficient operation under resource constraints [8]. - A block-wise distillation process was applied to the original Llama model, constructing multiple variants for each module and searching for optimal combinations [9]. Training and Dataset - The model was trained on 40 billion tokens from three datasets: FineWeb, Buzz-V1.2, and Dolma, focusing on English single-turn and multi-turn conversations [10]. - Post-training involved a combination of supervised fine-tuning and reinforcement learning to enhance performance in key tasks such as coding, mathematics, reasoning, and instruction following [10]. Deployment and Ecosystem - NVIDIA's AI models are optimized for running on NVIDIA GPU-accelerated systems, achieving significant speed improvements over CPU-only solutions [12]. - Llama Nemotron Super v1.5 is now open-source, available for developers on build.nvidia.com or via Hugging Face [13]. Ecosystem and Model Series - The Llama Nemotron ecosystem integrates large language models, training and inference frameworks, optimization tools, and enterprise deployment solutions for high-performance AI application development [14]. - NVIDIA has introduced three series of large language models: Nano, Super, and Ultra, catering to different deployment needs and user profiles [16]. - The Super series, including Llama Nemotron Super v1.5, balances precision and computational efficiency for single GPU use [17]. Enterprise Support - The Nemotron model has gained support from major enterprises like SAP, Microsoft, and Deloitte for building AI agent platforms aimed at enterprise-level process automation and complex problem-solving [17].