双向建模

Search documents
华人团队终结Token危机:扩散模型数据潜力超自回归三倍
量子位· 2025-08-13 09:13
Core Viewpoint - The article discusses the potential of diffusion language models (DLMs) in data learning, highlighting their ability to outperform autoregressive models in terms of data utilization and learning efficiency [1][4]. Group 1: Diffusion Language Models - Diffusion language models can achieve over three times the data potential compared to autoregressive models when token quantity is limited [1]. - A diffusion model with 1 billion parameters trained on 1 billion tokens for 480 cycles achieved 56% and 33% accuracy on HellaSwag and MMLU benchmarks, respectively, without any data filtering or tricks [5]. - The model's performance did not show saturation even under extreme repetition, indicating that it can extract more useful information from the data [4]. Group 2: Learning Mechanisms - The strong data learning capability of diffusion language models is attributed to two main factors: the diffusion objective and bidirectional attention mechanisms, allowing for comprehensive data utilization beyond causal relationships [8][9]. - Diffusion models invest more computational resources (FLOPs) during training and inference, enhancing model performance through iterative optimization [11]. - Unlike autoregressive models that prioritize computational efficiency, diffusion models focus on maximizing data potential, which leads to improved learning outcomes [14]. Group 3: Overfitting and Data Utilization - The research team observed that the number of training cycles before overfitting occurs is positively correlated with the amount of unique data and negatively correlated with model size [18]. - Even when overfitting occurs, the model's performance on downstream tasks may continue to improve, suggesting that absolute loss values do not necessarily translate to relative performance changes [19][21]. - The phenomenon of overconfidence in certain text segments after repeated exposure to limited training data may explain the observed performance trends [26][27]. Group 4: Future Research Directions - The research team plans to use larger models and more unique data in future studies to further validate their findings and hypotheses regarding diffusion language models [28].