Workflow
模型崩溃
icon
Search documents
合成数据的「毒」与「药」,模型崩溃有何新解?
机器之心· 2025-08-30 01:30
Group 1 - The core viewpoint of the article highlights the advancements in synthetic data research, particularly in understanding the collapse mechanisms of models during self-training with synthetic data and establishing application processes in various stages of model development [1]. Group 2 - Research over the past year has revealed new findings regarding the "toxicity" of synthetic data, indicating that model collapse occurs during iterative training, leading to a gradual pollution of the training dataset [5]. - In the early collapse stage, models begin to lose information about the distribution tails (low-probability events), while in the late collapse stage, models converge to outputs that bear little resemblance to the original data distribution [6][7]. - The occurrence of this collapse is influenced by model design, learning processes, and the quality of the data used [7]. - Various generative models, including language models, Variational Autoencoders (VAE), and Gaussian Mixture Models (GMM), are prone to collapse phenomena [8]. - However, some researchers argue that the risks of model collapse may be overstated, suggesting that maintaining a certain proportion of real data and following proper training processes can mitigate these issues [4][5]. Group 3 - Despite the risks associated with model collapse, synthetic data plays an irreplaceable role in model training, prompting the industry to propose a systematic framework for generating and applying synthetic data [9]. - A table summarizing the usage of synthetic data across various stages of model training is referenced, indicating its significance in pre-training, fine-tuning, post-training, and evaluation [10].
ICML 2025 | 如何在合成文本数据时避免模型崩溃?
机器之心· 2025-05-14 04:36
随着生成式人工智能技术的飞速发展,合成数据正日益成为大模型训练的重要组成部分。未来的 GPT 系列语言模型不可避免地将依赖于由人工数据和合成数据混 合构成的大规模语料。 然而,这一趋势也带来了严峻挑战:合成数据如果不加控制地使用,可能引发 "模型崩溃"(Model Collapse)问题。即便仅在一次训练中混入较多比例的合成数 据,也可能导致模型性能急剧下降,难以泛化到真实世界的数据中。 $$\mathbb{E}_{t e s t}^{c o l l a p s e}={\frac{\sigma^{2}d}{T-d-1}}\cdot n\qquad(\mathcal{I})$$ 最近在 ICML 2025 会议上,来自上交大等研究机构的研究团队系统性地剖析了这一问题,并提出了一种创新的数据生成策略, Token-Level Editing,旨在有效避 免模型崩溃。 论文标题:HOW TO SYNTHESIZE TEXT DATA WITHOUT MODEL COLLAPSE? 论文链接:https://arxiv.org/pdf/2412.14689 不同于直接使用生成数据,该方法在真实数据上引入细粒度的 " ...