投机采样
Search documents
腾讯AngelSlim升级,首个集LLM、VLM及语音多模态为一体的投机采样训练框架,推理速度飙升1.8倍
机器之心· 2026-01-16 01:55
Core Insights - The article discusses the challenges of high inference costs and delays in the large model application landscape, highlighting the need for cost reduction and efficiency improvements in the industry [2] - Speculative sampling is introduced as a novel inference acceleration paradigm that offers near-lossless speedup, gaining popularity in the industry [2] - Tencent's upgraded AngelSlim training framework leverages speculative sampling to enhance performance across various modalities, achieving significant inference speed improvements [2] Group 1: AngelSlim and Speculative Sampling - Speculative sampling utilizes a lightweight draft model to generate multiple candidate tokens, which are then verified by a larger model, effectively parallelizing the decoding process and reducing latency [4] - AngelSlim integrates various compression algorithms, including quantization and speculative sampling, to support multi-modal model training, achieving acceleration rates of 1.4 to 1.9 times [4][6] - The framework emphasizes deployment readiness, allowing models trained with AngelSlim to be seamlessly integrated into existing frameworks like vLLM and Sglang [7] Group 2: Key Features of AngelSlim - AngelSlim supports full-modal speculative sampling training, enabling shared core algorithms and engineering capabilities across different modalities [6] - The data processing module provides a stable and reusable data foundation for training across multiple modalities, including data resampling and preprocessing [12][13] - The model module features a unified TargetModel interface, allowing easy integration of new model architectures without modifying core algorithms [18] Group 3: Training Components and Performance - The training module is designed for both online and offline training modes, catering to different model sizes and memory constraints [20] - The training process includes training-time testing, allowing the model to learn from its own predictions during training [21] - AngelSlim's trained models have demonstrated acceleration performance in various tasks, achieving speedups of 1.4 to 1.9 times under specific conditions [25] Group 4: Future Plans - Future developments will focus on enhancing speculative sampling capabilities through tool and algorithm advancements, including offline hidden states generation and deeper integration of multi-modal features [30]
腾讯发布SpecExit算法,无损压缩端到端加速2.5倍!解决大模型长思考效率难题
机器之心· 2025-10-24 03:40
Core Insights - The article discusses the introduction of the SpecExit method, which integrates early stopping and speculative sampling to enhance the efficiency of Large Reasoning Models (LRMs) by reducing reasoning chain length by 66% and achieving a 2.5x end-to-end acceleration on vLLM [2][9][28]. Group 1: Challenges and Innovations - The challenges of early stopping in reasoning models include high training costs and potential reliability issues with training-based methods, while training-free methods often incur additional computational overhead [5][10]. - SpecExit leverages the natural advantages of speculative sampling to ensure consistent model outputs while extracting reasoning progress signals from the draft model's hidden states [9][10]. - The SpecExit framework allows for dynamic and reliable early stopping without introducing extra detection costs, achieving significant acceleration compared to baseline methods [9][22]. Group 2: SpecExit Methodology - The SpecExit training process involves constructing data from the model's complete outputs, labeling signals such as confidence, remaining reasoning length, and reasoning progress, and employing multi-task learning to optimize these signals alongside token classification [13][14][15]. - The method utilizes an exponential weighted moving average to smooth the signals, ensuring robust early stopping decisions during the decoding phase [19][21]. Group 3: Experimental Results - Evaluations on various benchmarks show that SpecExit significantly reduces reasoning lengths, with reductions of 54% and 53% on the GSM8K and ARC-Challenge datasets, respectively, while maintaining accuracy [23][24]. - Compared to other early stopping methods, SpecExit not only shortens reasoning lengths but also provides substantial improvements in inference speed, making it more practical for real-world applications [25][28]. Group 4: Conclusion - SpecExit demonstrates high generalization capabilities across diverse tasks and models, revealing the potential of hidden states as efficient reasoning information signals, which may guide future research in this area [28].
超大模型推理加速2.18倍!SGLang联合美团技术团队开源投机采样训练框架
量子位· 2025-07-26 09:01
Core Viewpoint - SpecForge is an open-source training framework designed for speculative sampling, specifically tailored for large models, achieving a 2.18x inference acceleration [1][15]. Group 1: SpecForge Overview - SpecForge is developed by the SGLang team in collaboration with Meituan's search recommendation platform and Cloudsway.AI [1]. - The framework is built to address the challenges posed by the increasing size of models, which often leads to lower inference efficiency [4][6]. - SpecForge integrates deeply with the SGLang inference engine, providing a seamless training and inference process for speculative sampling [5][7]. Group 2: Technical Features - The framework incorporates Eagle3, an advanced speculative sampling method that enhances inference speed by training a lightweight draft model to predict token distributions accurately [7]. - SpecForge supports various mainstream models, including complex MoE layers and Transformer variants, ensuring broad applicability [7]. - It features scalable distributed training through Fully Sharded Data Parallel (FSDP) and Tensor Parallelism (TP), optimizing resource utilization on GPU clusters [7][14]. Group 3: Training Modes and Efficiency - SpecForge offers two training modes: Online and Offline, allowing users to choose based on their specific needs and resource availability [10][17]. - The Training-Time Test (TTT) architecture enhances the robustness of the draft model, encapsulating complex processes to simplify implementation for users [9]. - The framework is designed with a focus on memory-efficient training, significantly reducing memory overhead even for trillion-parameter models [7]. Group 4: Experimental Validation - The effectiveness of SpecForge was validated through experiments on datasets like ShareGPT and UltraChat, demonstrating compatibility with the Eagle3 architecture [15]. - The draft models trained using SpecForge achieved a notable 2.18x inference acceleration on the MT-Bench benchmark [15]. Group 5: Future Developments - SpecForge's roadmap includes plans to support additional model architectures and integrate visual-language models (VLM) into the framework [22]. - The team aims to enhance training efficiency through improved parallel strategies and kernel optimizations [22].
0.5B以小搏大拿下端侧模型新SOTA:4090可跑,长文本处理5倍常规加速丨清华&面壁开源
量子位· 2025-06-10 07:35AI Processing
清华大学&面壁智能 投稿 量子位 | 公众号 QbitAI 端侧性价比之王,清华大学和面壁智能团队开源新模型—— MiniCP M 4 ,提供 8B、0.5B 两种参数规模, 仅使用同级别开源模型22%的训练开销 ,就达到了同级别最优性能。 MiniCPM4-8B是 开源首个开源的原生稀疏模型,5%的极高稀疏度加持,让长文本、深思考在端侧真正跑起来。 在MMLU、CEval、MATH500、HumanEval等基准测试中,以仅22%的训练开销,性能比肩 Qwen-3-8B,超越Gemma-3-12B。 MiniCPM4-0.5B 在性能上,也展现出以小博大——在MMLU、CEval、BBH、HumanEval等基准测试中,MiniCPM4.0 -0.5B性能超越同级 的Qwen-3-0.6B、Llama 3.2、Gemma3, 并通过 原生QAT技术 实现几乎不掉点的int4量化以及600Token/s的极速推理速度。 在常见端侧芯片,比如Jetson AGX Orin与RTX 4090上,MiniCPM 4可实现长文本处理的5倍常规加速与极限场景下的百倍加速。 请看VCR: 目前团队已公开发布技术报告,该模 ...