Workflow
AI Training Data Pollution
icon
Search documents
好家伙!GPT-4o 学习“波多野结衣”的次数,比“您好”还多 2.6 倍
程序员的那些事· 2025-10-20 14:39
Core Viewpoint - The article discusses the findings of a recent research paper that reveals significant contamination in the training data of large language models (LLMs) like ChatGPT, particularly highlighting the prevalence of inappropriate content related to adult film star "波多野结衣" [5][8][9]. Group 1: Research Findings - Researchers from Tsinghua University and Nanyang Technological University discovered that over 23% of long Chinese tokens in GPT's vocabulary are associated with gray content such as pornography and gambling, indicating severe contamination in the model's Chinese vocabulary [5][8]. - The study identified and quantified these contaminated tokens, termed "污染中文词元" (PoC Tokens), suggesting that content related to "波多野结衣" may constitute as much as 0.5% of the training data for GPT-4o, which is 2.6 times more frequent than the common greeting "你好" [9][11]. - The presence of PoC Tokens poses a risk to AI, as it may lead to erratic responses and a lack of coherence when processing pure Chinese content [12][11]. Group 2: Implications for AI Models - The findings highlight a significant bias in the training data, which may explain why some models struggle with authentic and clean Chinese language processing [11]. - The widespread existence of PoC Tokens reflects the serious challenges faced by the current Chinese web corpus used for training LLMs, suggesting a need for improved data curation [14]. - The article also references a recent lawsuit against Meta by adult film companies for allegedly using pirated content to train AI, further emphasizing the ongoing issues surrounding content sourcing for AI training [14].