Workflow
预训练的自回归大型语言模型
icon
Search documents
五倍推理加速,激发自回归潜能,苹果新工作让LLM预测未来
机器之心· 2025-07-24 04:08
Core Viewpoint - The article discusses the advancements in language models, particularly focusing on a new framework developed by Apple researchers that allows autoregressive models to perform multi-token predictions, significantly improving inference speed while maintaining generation quality [7][8][9]. Group 1: Advances in Language Models - Recent progress in language models is attributed to the availability of large-scale text data and the effectiveness of autoregressive training methods [2]. - Autoregressive models predict each token based on preceding context, which provides a clear advantage during training but incurs high computational costs during inference due to sequential execution [5][6]. Group 2: New Framework Development - Apple researchers have developed a framework that enables pre-trained autoregressive language models to execute multi-token predictions, achieving up to 5.35 times speedup for code and math tasks, and approximately 2.5 times for general tasks [7]. - This innovation allows for a significant reduction in AI operational costs and the potential for powerful real-time assistants to run smoothly on lightweight devices [9]. Group 3: Research Findings - The researchers confirmed that language models can generate multiple tokens in a single inference step, which is a promising development for speeding up generation processes [11]. - The study explored whether it is possible to train truly non-autoregressive language models, leading to the design of a training algorithm that minimally alters existing autoregressive frameworks while achieving efficient multi-token generation [13][14]. Group 4: Experimental Results - Experiments conducted on the Tulu3-8B model demonstrated that the proposed multi-token generation algorithm achieved speedups ranging from approximately 1.5 to 5.2 times across various tasks, with the most significant improvements observed in programming and math tasks [46]. - The introduction of mask tokens and a lightweight sampling module allowed the model to leverage its full depth and representational capabilities, resulting in superior performance compared to existing multi-token prediction methods [23][24]. Group 5: Future Directions - Future research could explore the applicability of this method during pre-training or downstream task adaptation phases to further assess its effectiveness [53]. - Another promising direction is the application of diffusion-based generation methods to multi-token prediction tasks, aiming to balance efficiency and quality [53].