Workflow
离散扩散大语言模型后训练
icon
Search documents
推理效率狂飙60倍:DiDi-Instruct让扩散大模型16步超越千步GPT
机器之心· 2025-10-27 05:23
Core Insights - The article introduces DiDi-Instruct, a post-training method for discrete diffusion large language models (dLLMs), which accelerates text generation by up to 60 times compared to traditional GPT models and dLLMs [2][3]. Group 1: Research Background - The inherent bottleneck of autoregressive models in generating long texts leads to a delay ceiling, prompting the emergence of diffusion language models (dLLMs) that support parallel text generation [6]. - Existing dLLMs require hundreds of iterations to match the performance of models like GPT-2, raising the question of whether a model can significantly outperform GPT with fewer iterations [6][7]. Group 2: DiDi-Instruct Overview - DiDi-Instruct is a post-training algorithm that distills a dLLM, reducing the inference steps from 1024 to just 8-16 while enhancing modeling performance [7]. - The core idea of DiDi-Instruct is to minimize the Integral Kullback-Leibler Divergence between a "student" model with fewer sampling steps and a "teacher" dLLM model [7][10]. Group 3: Methodology Innovations - DiDi-Instruct employs a policy gradient approach to reformulate the distillation objective, introducing a reward function to guide the student model's updates [10]. - An auxiliary discriminator network is used to distinguish between outputs from the student and teacher models, providing precise reward signals for optimization [10]. - Key techniques for stable training and high-quality inference include Grouped Reward Normalization and Intermediate-state Matching, which enhance training stability and model diversity [10]. Group 4: Experimental Results - In experiments on the OpenWebText dataset, DiDi-Instruct achieved state-of-the-art (SOTA) performance, with perplexity metrics consistently outperforming baseline models [14]. - The model demonstrated a perplexity improvement of over 30% compared to the best baseline model while maintaining nearly no entropy loss (about 1%) [14][16]. - The training process for DiDi-Instruct is highly efficient, requiring only about 1 hour on a single NVIDIA H100 GPU, significantly reducing the training time compared to other methods [16]. Group 5: Cross-Domain Applicability - DiDi-Instruct's framework is not limited to language models; it has been successfully applied to unconditional protein sequence generation, demonstrating its versatility [17]. - The distilled student model retains the ability to generate variable-length sequences while significantly lowering inference costs [17]. Group 6: Component Contributions - Ablation studies reveal that Intermediate-state Matching is crucial for model stability, with its removal leading to catastrophic performance declines [19]. - The role of regularization varies with the number of sampling steps, indicating that it can stabilize training at low steps but may hinder performance at higher steps [25].