Workflow
大模型推理加速
icon
Search documents
腾讯发布SpecExit算法,无损压缩端到端加速2.5倍!解决大模型长思考效率难题
机器之心· 2025-10-24 03:40
Core Insights - The article discusses the introduction of the SpecExit method, which integrates early stopping and speculative sampling to enhance the efficiency of Large Reasoning Models (LRMs) by reducing reasoning chain length by 66% and achieving a 2.5x end-to-end acceleration on vLLM [2][9][28]. Group 1: Challenges and Innovations - The challenges of early stopping in reasoning models include high training costs and potential reliability issues with training-based methods, while training-free methods often incur additional computational overhead [5][10]. - SpecExit leverages the natural advantages of speculative sampling to ensure consistent model outputs while extracting reasoning progress signals from the draft model's hidden states [9][10]. - The SpecExit framework allows for dynamic and reliable early stopping without introducing extra detection costs, achieving significant acceleration compared to baseline methods [9][22]. Group 2: SpecExit Methodology - The SpecExit training process involves constructing data from the model's complete outputs, labeling signals such as confidence, remaining reasoning length, and reasoning progress, and employing multi-task learning to optimize these signals alongside token classification [13][14][15]. - The method utilizes an exponential weighted moving average to smooth the signals, ensuring robust early stopping decisions during the decoding phase [19][21]. Group 3: Experimental Results - Evaluations on various benchmarks show that SpecExit significantly reduces reasoning lengths, with reductions of 54% and 53% on the GSM8K and ARC-Challenge datasets, respectively, while maintaining accuracy [23][24]. - Compared to other early stopping methods, SpecExit not only shortens reasoning lengths but also provides substantial improvements in inference speed, making it more practical for real-world applications [25][28]. Group 4: Conclusion - SpecExit demonstrates high generalization capabilities across diverse tasks and models, revealing the potential of hidden states as efficient reasoning information signals, which may guide future research in this area [28].
华为,AI大动作!
证券时报· 2025-08-10 07:00
Core Viewpoint - Huawei is set to release groundbreaking technology in AI inference that may reduce China's reliance on HBM (High Bandwidth Memory) technology, enhancing the performance of domestic AI large model inference and improving the AI inference ecosystem in China [1]. Group 1: AI Inference Technology - Huawei will jointly launch the latest AI inference application results with China UnionPay on August 12, introducing a significant inference acceleration technology [1]. - HBM is crucial for addressing "data transportation" issues; insufficient HBM can lead to poor user experience in AI inference, resulting in task delays and slow responses [2]. Group 2: Forum and Expert Contributions - Experts from the China Academy of Information and Communications Technology, Tsinghua University, and iFlytek will share practices on large model inference acceleration and experience optimization at the "2025 Financial AI Inference Application Implementation and Development Forum" on August 12 [3]. Group 3: Event Schedule - The event schedule includes: - Opening remarks [5] - Introduction and release ceremony of UnionPay's inference application results [5] - Presentation of Huawei's AI storage inference acceleration solution [5] - Discussion on large model inference optimization and new paradigms for industrial implementation [5] - Presentation on KV Cache storage-centered large model inference architecture by Tsinghua University [5] - iFlytek's high-performance inference practices on the MaaS platform [5]
ICML 2025|如何凭「自动补全」实现100K生成3×加速?
机器之心· 2025-05-18 04:25
Core Viewpoint - The article discusses the challenges of generating ultra-long texts in the era of complex large models and introduces TokenSwift, a new inference acceleration framework that significantly improves efficiency while maintaining output quality [1][27][29]. Group 1: Challenges in Long Text Generation - Traditional autoregressive methods generate one token at a time, leading to performance degradation as sequence lengths increase to 100,000 tokens or more [4][5]. - The main bottlenecks include model redundancy, KV cache inflation, and semantic repetition, which hinder the efficiency and diversity of generated outputs [9][19]. Group 2: TokenSwift Framework - TokenSwift proposes a lightweight and efficient framework that restructures traditional autoregressive inference by introducing a mechanism based on multi-token drafting, parallel validation, and dynamic cache updates [7][11]. - The framework allows for the parallel generation of multiple candidate tokens, significantly reducing model reload frequency and I/O time while ensuring semantic relevance [12][17]. Group 3: Key Technical Innovations - The n-gram heuristic completion mechanism utilizes historical fragments to enhance the accuracy of token drafting, ensuring high semantic relevance [14]. - A tree-structured parallel validation module assesses the drafted tokens against standard autoregressive paths, ensuring lossless output quality [15][17]. - Dynamic KV management and repetition penalties are implemented to mitigate cache inflation and enhance output diversity, respectively [19][26]. Group 4: Performance Evaluation - Extensive experiments on various mainstream models demonstrate that TokenSwift achieves acceleration ratios exceeding 3 times while maintaining output quality consistent with original models [21][22]. - The acceleration effect becomes more pronounced with longer sequences, reducing generation time for 100K token tasks from nearly 5 hours to 1.5 hours [22]. Group 5: Conclusion and Future Implications - TokenSwift is not a new model but a universal acceleration strategy that can be integrated into existing models like LLaMA and Qwen, offering strong compatibility and deployment convenience [28]. - The framework's lossless guarantee for inference quality positions it as a robust technical support for future applications in multi-turn reasoning, code generation, and agent planning [29].