大模型训练或无需“纯净数据”！北大团队新研究：随机噪声影响有限，新方法让模型更抗噪

Core Insights - The article challenges the traditional belief that language models require "clean data" for effective training, suggesting that exposure to noise and imperfect data can still lead to strong language capabilities [1][2]. Group 1: Research Findings - Researchers from Peking University conducted experiments by intentionally adding random noise to training data, revealing that models could tolerate up to 20% "garbage data" with minimal impact on performance, as the Next-token Prediction (NTP) loss increased by less than 1% [2][4]. - The study utilized the OpenWebText dataset and injected random noise ranging from 1% to 20%, demonstrating that even with significant noise, the model's predictive loss remained relatively stable [3][4]. - The findings indicate a complex relationship between noise and model performance, leading to the development of a new method called "Local Gradient Matching" (LGM) to enhance model robustness in noisy environments [2][10]. Group 2: Theoretical Analysis - The research posits that the presence of noise does not significantly alter the global minimum of the NTP loss, even when noise levels are high, due to the low probability of finding meaningful text within random noise [6][7]. - The study's assumptions can be extended to multilingual models, where different languages can be viewed as noise to each other, thus not adversely affecting the performance of individual languages [9]. Group 3: Practical Implications - Despite the minor changes in pre-training loss, downstream tasks showed a decline in accuracy, highlighting a "loss-performance decoupling" phenomenon where pre-training metrics do not fully capture model capabilities [10]. - The proposed LGM method enhances the model's resistance to noise by constraining the gradient differences between original and perturbed features, thereby maintaining decision consistency under noise [10][12]. - Experimental results across various natural language understanding and visual classification datasets confirmed that LGM significantly improves performance for models affected by noise [11][12]. Group 4: Future Directions - The research opens new perspectives on large-scale pre-training, suggesting that retaining some random noise can reduce data cleaning costs, particularly for resource-constrained teams [15]. - Future work will explore the dynamic relationship between noise types and model capacity, as well as the application of LGM in other modalities [14].