微信炼出扩散语言模型,实现vLLM部署AR模型3倍加速,低熵场景超10倍

Core Viewpoint - Tencent's WeChat AI team has introduced WeDLM (WeChat Diffusion Language Model), which achieves over 3 times acceleration in mathematical reasoning tasks compared to AR models deployed with vLLM, and up to 10 times in low-entropy scenarios, while maintaining or even improving generation quality [2][4][13]. Group 1: Introduction and Background - The current mainstream decoding paradigm for large language models is autoregressive (AR) generation, but its token-by-token generation limits inference efficiency. Diffusion language models (Diffusion LLMs) offer an alternative by restoring multiple masked tokens in parallel, yet existing models struggle to surpass optimized AR inference engines like vLLM in speed [3]. - The key issue is that most diffusion language models use bidirectional attention mechanisms, which are incompatible with standard KV caching, preventing the advantages of parallel prediction from translating into actual speed improvements [4]. Group 2: WeDLM Model Insights - WeDLM is the first diffusion language model that surpasses equivalent AR models in inference speed under industrial-grade inference engine (vLLM) optimization conditions [4]. - The core insight of WeDLM is that mask recovery does not require bidirectional attention. It allows each masked position to access all observed tokens, which can be achieved under standard causal attention [11]. - A critical metric introduced is Prefix Cacheability, which indicates that in KV caching decoding, only tokens forming a continuous left-to-right prefix can be cached and reused. Thus, the efficiency of inference is influenced more by how many predictions can convert into cacheable prefixes rather than how many tokens are predicted at each step [11]. Group 3: Technical Solutions - WeDLM employs Topological Reordering to maintain causal attention while allowing masked positions to access the complete observed context. This involves moving all observed tokens to the front of the physical sequence while preserving their logical positions through RoPE positional encoding [16]. - The model also features Dual-Stream Masking to reduce the distribution gap between training and inference, creating a clean "memory stream" and a masked "prediction stream" that share positional encoding [18]. - During inference, WeDLM utilizes Streaming Parallel Decoding, allowing immediate submission of parsed prefixes rather than waiting for an entire block to complete [21]. Group 4: Performance Metrics - In mathematical reasoning tasks, WeDLM achieves approximately 3 times acceleration and significantly outperforms other diffusion models like LLaDA and Dream in both accuracy and inference speed [13]. - In benchmark evaluations, WeDLM-8B scores an average of 74.72, surpassing Qwen3-8B by 2.1 points, with notable improvements in mathematical reasoning tasks such as GSM8K and MATH [24]. - The model demonstrates significant speed advantages in various task scenarios, achieving 3-6 times acceleration in structured outputs for mathematical reasoning, 2-3 times in code generation, and over 10 times in low-entropy tasks like sequence counting [27]. Group 5: Conclusion - The contributions of WeDLM highlight that Prefix Cacheability should be a primary design goal for parallel text generation. Future diffusion language models should be viewed as efficient multi-token prediction mechanisms, where the value of parallel token generation depends on how quickly these tokens can be converted into cacheable prefixes [31].

微信炼出扩散语言模型,实现vLLM部署AR模型3倍加速,低熵场景超10倍 - Reportify