Dream
Search documents
扩散语言模型新发现:其计算潜力正在被浪费?
机器之心· 2025-10-30 08:52
Core Insights - The article discusses the limitations of the traditional left-to-right sampling method in large language models and introduces the Masked Diffusion Language Model (MDLM) as a potential alternative, highlighting its advantages in various tasks [1][5][15]. Group 1: MDLM Advantages - MDLM allows for arbitrary order decoding and supports multi-token parallel decoding, which can enhance performance in certain tasks like Sudoku [1][4]. - However, recent findings indicate that in mathematical and coding tasks, arbitrary order algorithms often perform worse than left-to-right sampling, and multi-token decoding significantly reduces performance [1][4][5]. - The study suggests that using MDLM for left-to-right sampling can be an efficient approach for reasoning and coding tasks, especially when block sizes are enforced to maintain a semi-autoregressive structure [3][5]. Group 2: Prompting and In-filling Techniques - The researchers propose a new prompting method called "prompting-as-in-filling," which allows users to add context at multiple positions rather than just the beginning of the sequence [6][18]. - They introduce a "reasoning-as-in-filling" framework that utilizes a reasoning template to guide the model in generating reasoning trajectories based on a given budget and format [6][18][19]. Group 3: Early Exit Mechanism - The "reasoning-as-in-filling" method enables the model to quantify answer uncertainty during the reasoning process, allowing for early exits to reduce computational costs [8][19]. - For instance, in the GSM8k dataset, this approach led to a 24% reduction in function calls without sacrificing accuracy [8]. Group 4: Multi-Token Entropy Decoding (MED) - The researchers developed an adaptive multi-token decoder called MED, which only performs parallel decoding when the conditional entropy of the additional positions is below a set threshold, thus controlling the deviation from single-token decoding [10][24]. - Experimental results show that the MED method can achieve a 2-3 times reduction in function calls while maintaining performance [11][26]. Group 5: Post-Training Capabilities - The study highlights that MDLM's in-filling capabilities unlock new sampling and post-training mechanisms, allowing for effective post-training without the need for complex prompt designs or additional models [22][23]. - By sampling reasoning trajectories from the posterior distribution, researchers can enhance the model's performance on reasoning tasks [22][23][33]. Group 6: Performance Metrics - The article presents various performance metrics, showing that using MED can lead to significant speed improvements while maintaining accuracy across different datasets [26][30]. - The results indicate that early exit mechanisms combined with MED can further optimize computational efficiency, particularly in the LLaDA model [31][32].