从训练到推理的「瘦身」演进：首篇高效扩散语言模型（dLLM）深度综述

Core Insights - The article discusses the rise of Diffusion Language Models (dLLMs) as a promising alternative to Autoregressive (AR) models, highlighting their potential for parallel generation and improved controllability [2][3]. Training Efficiency: Leveraging Existing Models - dLLMs require significant data and computational resources for training, making it crucial to utilize existing pre-trained AR models [7]. - The paper identifies two main strategies for improving training efficiency: "migration from AR to dLLM" and "architecture optimization" [8]. - Techniques such as Block Diffusion allow for a hybrid approach that retains some AR structure while enabling parallel processing, thus reducing adaptation costs [9]. - Architectural innovations like the Encoder-Decoder structure and the introduction of Mixture of Experts (MoE) help to lower training costs by reusing features and reducing parameter calculations during inference [9]. Inference Acceleration: Parallel Decoding and Compression - Inference speed is a critical challenge for dLLMs, as the diffusion process involves multiple iterative steps [11]. - The paper categorizes inference acceleration strategies into "parallel decoding" and "compression techniques" [11]. - dLLMs can update multiple tokens simultaneously, but determining which tokens to update is essential for efficiency [14]. - Compression methods, including fine-grained quantization, are tailored to the unique characteristics of the diffusion process, achieving extremely low bit quantization [14]. KV Cache Management: Addressing Dynamic Challenges - The management of KV Cache is a significant difference between dLLMs and AR models, as the sequence changes at each denoising step [15]. - The paper outlines three strategies for managing KV Cache: architectural adjustments, adaptive refresh techniques, and heuristic methods [18][19]. - Architectural adjustments like Block Diffusion allow for fixed prefixes and dynamic suffixes, while adaptive refresh techniques utilize token stability to minimize cache updates [18]. - Heuristic methods leverage uncertainty to determine which tokens to retain, enhancing efficiency without retraining the model [19]. Speculative Decoding: Self-Game and Collaborative Strategies - Speculative decoding in dLLMs manifests in two unique forms: self-speculation and synergy with AR models [21][26]. - Self-speculation involves the model predicting intermediate states, while synergy combines the strengths of both dLLMs and AR models for improved throughput [26]. Summary and Future Directions - The paper emphasizes the need for a unified evaluation standard to compare efficiency across different models, considering training costs and memory usage [24]. - Hardware-aware kernel optimizations are necessary to translate theoretical acceleration into practical performance improvements [24]. - The potential for multimodal integration in dLLMs presents an exciting avenue for future research and application [25]. - The article serves as a roadmap for the transition of dLLMs from academic exploration to industrial application, indicating their growing relevance in high-quality, controllable generation scenarios [25].