ICML 2025 | 如何在合成文本数据时避免模型崩溃？

Core Insights - The rapid development of generative artificial intelligence technology has made synthetic data an essential component for training large models like GPT series. However, uncontrolled use of synthetic data can lead to "model collapse," significantly degrading model performance and generalization to real-world data [1][2][6]. Group 1: Challenges of Synthetic Data - The phenomenon of "Non-iterative Collapse" occurs when a high proportion of synthetic data is mixed into training data, leading to a significant drop in model performance even after a single pre-training session [6]. - Synthetic data has two structural defects compared to human-generated data: a lack of low-frequency and long-tail samples, which hinders the representation of language diversity, and an over-concentration of language features, increasing the risk of model overfitting [13]. Group 2: Token-Level Editing Method - The Token-Level Editing method introduces fine-grained "micro-editing" operations on real data instead of generating entire segments, creating more stable and generalizable "semi-synthetic" data, thus mitigating the risk of model collapse [3][10]. - The editing process retains the long-tail structure of the original data while only adjusting "overconfident" tokens, ensuring that the model maintains coverage of the real data distribution and avoids feature over-concentration [11][15]. Group 3: Theoretical Results - The testing error of the Token-Level Editing process has a finite upper bound, preventing model collapse, and the error does not increase with the number of iterations [12][16]. - The theoretical framework indicates that even in multi-round training, Token-Level Editing can mathematically prevent unbounded error growth, establishing a "theoretically non-collapsing" data augmentation path [16]. Group 4: Experimental Validation - The effectiveness of Token-Level Editing was validated through systematic experiments across three key stages of language model training: pre-training, continual pre-training, and supervised fine-tuning [17]. - In the pre-training phase, models using edited data outperformed those using purely synthetic data, with an average task score increase of +0.36 percentage points on benchmarks like PIQA, BoolQ, and Winogrande [18]. - In the continual pre-training phase, significant cross-domain generalization improvements were observed, such as a +13.6% accuracy increase in the PubMedQA task [18]. - During the supervised fine-tuning phase, the method demonstrated strong robustness in complex tasks, with LLaMA-3 showing an average improvement of +0.4 to +0.5% [18].