大语言模型推理加速 - filings, earnings calls, financial reports, news

大语言模型推理加速

Search documents

EMNLP 2025 | 动态压缩CoT推理新方法LightThinker来了

机器之心· 2025-08-28 04:33

Core Viewpoint - The article discusses the development of LightThinker, a model that enhances the efficiency of large language models (LLMs) by compressing reasoning steps, thereby reducing memory usage and computational costs while maintaining accuracy [6][27]. Group 1: LightThinker Overview - LightThinker mimics human cognitive processes by dynamically compressing lengthy reasoning steps into concise representations, significantly reducing the number of tokens stored in the context window [6][27]. - The model's approach involves a cycle of generating, compressing, and discarding information, which helps maintain a small context size and addresses issues of memory overload and slow computation [14][27]. Group 2: Methodology - The first step in LightThinker's methodology is data reconstruction, where training data is modified to include "compression instructions," guiding the model on when to compress information [10]. - The second step involves attention modification, using a technique called "Thought-based Attention Mask" to control what the model can access during reasoning, ensuring it focuses on essential information [12]. - The third step is dynamic reasoning, where the model learns to rely on compact summaries for coherent reasoning rather than lengthy original thoughts [14][17]. Group 3: Experimental Results - LightThinker was tested across four datasets and two different models, showing significant improvements in peak memory usage and reasoning time, with a 70% reduction in peak memory and a 26% decrease in reasoning time while maintaining accuracy [21][27]. - The results indicate that LightThinker achieves a balance between accuracy and efficiency compared to traditional models [24][27]. Group 4: Limitations - The current method has limitations in mathematical tasks due to its data reconstruction approach, which relies on rules rather than semantic understanding, leading to potential information loss during compression [33].

超大模型推理加速2.18倍！SGLang联合美团技术团队开源投机采样训练框架

量子位· 2025-07-26 09:01

Core Viewpoint - SpecForge is an open-source training framework designed for speculative sampling, specifically tailored for large models, achieving a 2.18x inference acceleration [1][15]. Group 1: SpecForge Overview - SpecForge is developed by the SGLang team in collaboration with Meituan's search recommendation platform and Cloudsway.AI [1]. - The framework is built to address the challenges posed by the increasing size of models, which often leads to lower inference efficiency [4][6]. - SpecForge integrates deeply with the SGLang inference engine, providing a seamless training and inference process for speculative sampling [5][7]. Group 2: Technical Features - The framework incorporates Eagle3, an advanced speculative sampling method that enhances inference speed by training a lightweight draft model to predict token distributions accurately [7]. - SpecForge supports various mainstream models, including complex MoE layers and Transformer variants, ensuring broad applicability [7]. - It features scalable distributed training through Fully Sharded Data Parallel (FSDP) and Tensor Parallelism (TP), optimizing resource utilization on GPU clusters [7][14]. Group 3: Training Modes and Efficiency - SpecForge offers two training modes: Online and Offline, allowing users to choose based on their specific needs and resource availability [10][17]. - The Training-Time Test (TTT) architecture enhances the robustness of the draft model, encapsulating complex processes to simplify implementation for users [9]. - The framework is designed with a focus on memory-efficient training, significantly reducing memory overhead even for trillion-parameter models [7]. Group 4: Experimental Validation - The effectiveness of SpecForge was validated through experiments on datasets like ShareGPT and UltraChat, demonstrating compatibility with the Eagle3 architecture [15]. - The draft models trained using SpecForge achieved a notable 2.18x inference acceleration on the MT-Bench benchmark [15]. Group 5: Future Developments - SpecForge's roadmap includes plans to support additional model architectures and integrate visual-language models (VLM) into the framework [22]. - The team aims to enhance training efficiency through improved parallel strategies and kernel optimizations [22].