自回归(AR)模型
Search documents
token危机解决?扩散模型数据潜力3倍于自回归,重训480次性能仍攀升
机器之心· 2025-08-10 04:31
Core Viewpoint - The article discusses the advancements in diffusion language models (DLMs) as superior data learners compared to autoregressive (AR) models, particularly in data-constrained environments [1][8]. Group 1: Token Crisis and Research Findings - The research addresses the impending token crisis in large language models (LLMs), where the availability of high-quality training text data is diminishing, limiting model performance [2][3]. - The team pre-trained DLMs and AR models from scratch, achieving a maximum scale of 8 billion parameters and 480 billion tokens [3][4]. Group 2: Performance Comparison - In scenarios with limited tokens, DLMs outperform AR models, demonstrating over three times the data potential [5][8]. - A DLM trained on 1 billion tokens achieved 56% accuracy on the HellaSwag benchmark and 33% on the MMLU benchmark, significantly surpassing AR models [14]. Group 3: Repeated Training Benefits - Repeated training on the same dataset enhances performance, with DLMs showing no signs of performance saturation even after extensive training [14][19]. - The study indicates that DLMs can extract more effective information from a fixed dataset, leading to improved performance metrics [14][19]. Group 4: Mechanisms Behind DLMs' Superiority - DLMs utilize a bidirectional modeling approach, allowing them to extract more information from web data compared to purely causal modeling used by AR models [19][22]. - DLMs are described as "super dense models," translating their computational density into enhanced intelligence [22][24]. Group 5: Methodological Critique of Related Research - The article critiques a concurrent study, highlighting methodological flaws that may skew its conclusions regarding DLMs and AR models [25][30]. - It emphasizes that the loss function used in the other study does not accurately represent model likelihood, potentially leading to misleading results [26][32].