ChatGPT到底学了多少「污言秽语」？清华团队首提大语言模型中文语料污染治理技术

Core Viewpoint - The research highlights that the Chinese vocabulary of advanced ChatGPT models is contaminated with 46.6% polluted tokens, primarily related to pornography and gambling, which significantly affects the model's performance [3][6][41]. Group 1: Research Findings - The study identifies that the Chinese vocabulary of models like GPT-4o/o1/o3/4.5/4.1/o4-mini contains a high level of pollution, with specific examples of contaminated tokens including terms related to adult content and online gambling [3][6][12]. - A total of 1659 Chinese long tokens were analyzed, revealing that 773 tokens (46.6%) are polluted, with 219 tokens (13.2%) specifically related to adult content [13][14]. - The performance of ChatGPT models drops significantly when polluted tokens are input, with approximately 50% loss in interpretation and repetition tasks [17][18]. Group 2: Pollution Detection and Analysis - The research team developed a model to automatically detect polluted Chinese tokens, achieving a recognition accuracy of 97.3% [23]. - The study also proposes a pollution tracking scheme that estimates training data pollution based on vocabulary contamination, providing a lightweight solution for data governance [29][35]. - The analysis of open-source pre-training corpora revealed that polluted tokens cluster at the beginning and end of certain web pages, leading to misinterpretation by the models [19][21]. Group 3: Future Implications - The research raises questions about whether the presence of polluted data is entirely detrimental, suggesting that a moderate amount of harmful data might help in distinguishing harmful representations in models [37][40]. - The findings aim to provide a systematic approach for addressing the governance of large language model training data, potentially influencing future model training practices [41].