扩散语言模型真的会比自回归好？理论分析结果可能恰恰相反

Core Insights - The article discusses the potential of diffusion language models (DLMs) in text generation, comparing them to autoregressive models (ARMs) and highlighting the efficiency paradox observed in practical applications [1][3][7]. Group 1: Model Comparison - Autoregressive models generate text token-by-token, achieving high quality but limited by serial processing speed, especially for long sequences [3]. - Diffusion language models, particularly Masked Diffusion Models (MDMs), theoretically allow parallel sampling of multiple tokens, suggesting a potential efficiency improvement [3][4]. - However, practical experiments show that MDMs often require more sampling steps to match the accuracy of ARMs, leading to higher inference costs [3][4][12]. Group 2: Evaluation Metrics - The research emphasizes that the comparison between DLMs and ARMs heavily depends on the chosen evaluation metrics [10][15]. - Two key metrics are introduced: Token Error Rate (TER) for assessing token-level accuracy and Sequence Error Rate (SER) for evaluating overall sequence correctness [10][11]. - MDMs can demonstrate efficiency advantages when evaluated on TER, but they struggle with SER, particularly in tasks requiring logical consistency, such as mathematical reasoning [11][12][15]. Group 3: Practical Implications - The findings suggest that DLMs may be more suitable for tasks prioritizing text fluency and throughput, where some sequence-level imperfections are acceptable, such as creative writing [15]. - Conversely, for tasks demanding high sequence-level accuracy and logical correctness, ARMs remain the better choice due to the linear increase in sampling steps required by MDMs as sequence length grows [15][16]. - The research lays a theoretical foundation for understanding the comparative advantages and limitations of MDMs, indicating that the success of diffusion techniques in image generation does not directly translate to language generation [16].