Workflow
大模型推理加速
icon
Search documents
腾讯发布SpecExit算法,无损压缩端到端加速2.5倍!解决大模型长思考效率难题
机器之心· 2025-10-24 03:40
以 DeepSeek-R1 等为代表的推理模型(Large Reasoning Models, LRMs),通过生成更长的思维链,在各类复杂任务中取得了更优的表现。但长思维链是推理模型 的 "双刃剑",虽能提升性能,但 "过度思考" 带来的语义冗余会大幅推高推理成本。 为破解大模型长思维链的效率难题,并且为了更好的端到端加速落地,我们将思考早停与投机采样无缝融合,提出了 SpecExit 方法,利用轻量级草稿模型预测 "退出信号",在避免额外探测开销的同时将思维链长度缩短 66%,vLLM 上推理端到端加速 2.5 倍。 论文:https://arxiv.org/abs/2509.24248 开源代码:https://github.com/Tencent/AngelSlim 1."思考早停" 的挑战 目前对 LRMs 思维链压缩的相关研究大致可以分为两类,一类是基于训练的方法,另一类是 Training-Free 的方法,它们都有各自的局限性: (1)基于训练的方法,通过标注数据进行有监督微调,或通过强化学习减少思维链长度。尽管压缩效果显著,但往往伴随高昂的训练成本,并导致模型输出分布 被改变,引发模型可靠性 ...
华为,AI大动作!
证券时报· 2025-08-10 07:00
Core Viewpoint - Huawei is set to release groundbreaking technology in AI inference that may reduce China's reliance on HBM (High Bandwidth Memory) technology, enhancing the performance of domestic AI large model inference and improving the AI inference ecosystem in China [1]. Group 1: AI Inference Technology - Huawei will jointly launch the latest AI inference application results with China UnionPay on August 12, introducing a significant inference acceleration technology [1]. - HBM is crucial for addressing "data transportation" issues; insufficient HBM can lead to poor user experience in AI inference, resulting in task delays and slow responses [2]. Group 2: Forum and Expert Contributions - Experts from the China Academy of Information and Communications Technology, Tsinghua University, and iFlytek will share practices on large model inference acceleration and experience optimization at the "2025 Financial AI Inference Application Implementation and Development Forum" on August 12 [3]. Group 3: Event Schedule - The event schedule includes: - Opening remarks [5] - Introduction and release ceremony of UnionPay's inference application results [5] - Presentation of Huawei's AI storage inference acceleration solution [5] - Discussion on large model inference optimization and new paradigms for industrial implementation [5] - Presentation on KV Cache storage-centered large model inference architecture by Tsinghua University [5] - iFlytek's high-performance inference practices on the MaaS platform [5]
ICML 2025|如何凭「自动补全」实现100K生成3×加速?
机器之心· 2025-05-18 04:25
Core Viewpoint - The article discusses the challenges of generating ultra-long texts in the era of complex large models and introduces TokenSwift, a new inference acceleration framework that significantly improves efficiency while maintaining output quality [1][27][29]. Group 1: Challenges in Long Text Generation - Traditional autoregressive methods generate one token at a time, leading to performance degradation as sequence lengths increase to 100,000 tokens or more [4][5]. - The main bottlenecks include model redundancy, KV cache inflation, and semantic repetition, which hinder the efficiency and diversity of generated outputs [9][19]. Group 2: TokenSwift Framework - TokenSwift proposes a lightweight and efficient framework that restructures traditional autoregressive inference by introducing a mechanism based on multi-token drafting, parallel validation, and dynamic cache updates [7][11]. - The framework allows for the parallel generation of multiple candidate tokens, significantly reducing model reload frequency and I/O time while ensuring semantic relevance [12][17]. Group 3: Key Technical Innovations - The n-gram heuristic completion mechanism utilizes historical fragments to enhance the accuracy of token drafting, ensuring high semantic relevance [14]. - A tree-structured parallel validation module assesses the drafted tokens against standard autoregressive paths, ensuring lossless output quality [15][17]. - Dynamic KV management and repetition penalties are implemented to mitigate cache inflation and enhance output diversity, respectively [19][26]. Group 4: Performance Evaluation - Extensive experiments on various mainstream models demonstrate that TokenSwift achieves acceleration ratios exceeding 3 times while maintaining output quality consistent with original models [21][22]. - The acceleration effect becomes more pronounced with longer sequences, reducing generation time for 100K token tasks from nearly 5 hours to 1.5 hours [22]. Group 5: Conclusion and Future Implications - TokenSwift is not a new model but a universal acceleration strategy that can be integrated into existing models like LLaMA and Qwen, offering strong compatibility and deployment convenience [28]. - The framework's lossless guarantee for inference quality positions it as a robust technical support for future applications in multi-turn reasoning, code generation, and agent planning [29].