Workflow
扩散语言模型九倍推理加速!上海交大:KV Cache并非自回归模型的专属技巧
量子位·2025-05-27 03:53

Core Viewpoint - The article discusses the introduction of dLLM-Cache, a novel inference caching mechanism for diffusion-based Large Language Models (dLLMs), which significantly enhances inference speed without compromising output quality [2][3][21]. Research Motivation - Diffusion-based language models are emerging as a significant paradigm in language generation, showcasing superior capabilities in tasks like "reversing the curse" and mathematical reasoning compared to autoregressive models (ARMs) [8][10]. - However, the inference process of dLLMs typically requires hundreds of denoising steps, leading to substantial computational costs and inefficiencies, particularly since existing acceleration methods like KV Cache are incompatible with the bidirectional attention architecture of dLLMs [10][11]. Method Overview - The authors identified that features of prompt tokens remain stable during the denoising process, allowing for their reuse, while only a small portion of response tokens exhibit significant changes [13][14]. - The V-verify mechanism was introduced to efficiently identify which response tokens require updates based on the changes in their underlying features, achieving a reduction of up to 75% in redundant computations [16][17][20]. Experimental Results - The effectiveness of dLLM-Cache was rigorously tested on LLaDA 8B and Dream 7B models across various benchmark tasks, demonstrating over 5 times acceleration in inference speed while maintaining or slightly improving model performance [21][25]. - In specific tasks like HotpotQA, dLLM-Cache achieved a remarkable 9.1 times speedup without loss of quality, showcasing its robust performance across different dLLM architectures [21][28]. General Applicability - The dLLM-Cache method was successfully applied to different dLLM architectures, confirming its versatility and effectiveness in enhancing inference efficiency across various models [25][28].