过拟合

Search documents
华人团队终结Token危机:扩散模型数据潜力超自回归三倍
量子位· 2025-08-13 09:13
Core Viewpoint - The article discusses the potential of diffusion language models (DLMs) in data learning, highlighting their ability to outperform autoregressive models in terms of data utilization and learning efficiency [1][4]. Group 1: Diffusion Language Models - Diffusion language models can achieve over three times the data potential compared to autoregressive models when token quantity is limited [1]. - A diffusion model with 1 billion parameters trained on 1 billion tokens for 480 cycles achieved 56% and 33% accuracy on HellaSwag and MMLU benchmarks, respectively, without any data filtering or tricks [5]. - The model's performance did not show saturation even under extreme repetition, indicating that it can extract more useful information from the data [4]. Group 2: Learning Mechanisms - The strong data learning capability of diffusion language models is attributed to two main factors: the diffusion objective and bidirectional attention mechanisms, allowing for comprehensive data utilization beyond causal relationships [8][9]. - Diffusion models invest more computational resources (FLOPs) during training and inference, enhancing model performance through iterative optimization [11]. - Unlike autoregressive models that prioritize computational efficiency, diffusion models focus on maximizing data potential, which leads to improved learning outcomes [14]. Group 3: Overfitting and Data Utilization - The research team observed that the number of training cycles before overfitting occurs is positively correlated with the amount of unique data and negatively correlated with model size [18]. - Even when overfitting occurs, the model's performance on downstream tasks may continue to improve, suggesting that absolute loss values do not necessarily translate to relative performance changes [19][21]. - The phenomenon of overconfidence in certain text segments after repeated exposure to limited training data may explain the observed performance trends [26][27]. Group 4: Future Research Directions - The research team plans to use larger models and more unique data in future studies to further validate their findings and hypotheses regarding diffusion language models [28].
token危机解决?扩散模型数据潜力3倍于自回归,重训480次性能仍攀升
机器之心· 2025-08-10 04:31
机器之心报道 编辑:杜伟 扩散语言模型(DLMs)是超强的数据学习者。 token 危机终于要不存在了吗? 近日,新加坡国立大学 AI 研究者 Jinjie Ni 及其团队向着解决 token 危机迈出了关键一步。 在当前大语言模型(LLM)的持续发展中,面临的挑战之一是可用的高质量训练文本数据(tokens)即将枯竭,并成为限制模型性能持续提升的关键瓶颈。另外, 新增的高质量数据来源少,获取成本高,去重后更加稀缺。因此,当模型规模继续扩大,所需数据量按 Scaling Laws 成倍增加时,就出现了「优质 token 不够训 练」的危机。 针对这一现象, 该团队从零开始预训练了扩散语言模型(DLMs)与自回归(AR)模型,其中规模最高至 80 亿参数、4800 亿 tokens、480 个 epoch 。 研究有以下三项重要发现: 此外,团队还剖析了并行研究《Diffusion Beats Autoregressive in Data-Constrained Settings》中的严重方法论缺陷 —— 以共同提升开放评审的标准! Jinjie Ni 在社媒 X 上详细介绍了其团队的研究结论、研究方法,接下来 ...