Workflow
PoC Tokens
icon
Search documents
好家伙!GPT-4o 学习“波多野结衣”的次数,比“您好”还多 2.6 倍
程序员的那些事· 2025-10-20 14:39
Core Viewpoint - The article discusses the findings of a recent research paper that reveals significant contamination in the training data of large language models (LLMs) like ChatGPT, particularly highlighting the prevalence of inappropriate content related to adult film star "波多野结衣" [5][8][9]. Group 1: Research Findings - Researchers from Tsinghua University and Nanyang Technological University discovered that over 23% of long Chinese tokens in GPT's vocabulary are associated with gray content such as pornography and gambling, indicating severe contamination in the model's Chinese vocabulary [5][8]. - The study identified and quantified these contaminated tokens, termed "污染中文词元" (PoC Tokens), suggesting that content related to "波多野结衣" may constitute as much as 0.5% of the training data for GPT-4o, which is 2.6 times more frequent than the common greeting "你好" [9][11]. - The presence of PoC Tokens poses a risk to AI, as it may lead to erratic responses and a lack of coherence when processing pure Chinese content [12][11]. Group 2: Implications for AI Models - The findings highlight a significant bias in the training data, which may explain why some models struggle with authentic and clean Chinese language processing [11]. - The widespread existence of PoC Tokens reflects the serious challenges faced by the current Chinese web corpus used for training LLMs, suggesting a need for improved data curation [14]. - The article also references a recent lawsuit against Meta by adult film companies for allegedly using pirated content to train AI, further emphasizing the ongoing issues surrounding content sourcing for AI training [14].
GPT-4o学习“波多野结衣”的次数,比“您好”还多2.6倍
猿大侠· 2025-09-19 04:11
Core Viewpoint - The article discusses the contamination of language models, particularly GPT, by inappropriate content, highlighting the prevalence of certain terms related to adult entertainment in the training data [4][10]. Group 1: Research Findings - Researchers from Tsinghua University and Nanyang Technological University identified that popular language models like ChatGPT are contaminated by certain "PoC Tokens," which are defined as "polluted Chinese tokens" [6][4]. - In the long Chinese tokens of GPT, over 23% are associated with gray content such as pornography or gambling, indicating a significant level of contamination in the model's vocabulary [7][8]. - The study quantifies that content related to the adult film star "波多野结衣" constitutes approximately 0.5% of the training data for GPT-4o, which is 2.6 times more frequent than the common greeting "你好" [10]. Group 2: Implications and Concerns - The presence of PoC Tokens poses a risk to AI, as these elements can become ingrained in the AI's knowledge base, potentially leading to nonsensical or irrelevant responses [10]. - The widespread existence of these tokens reflects serious challenges in the quality of Chinese web corpus used for training large language models (LLMs) [13]. - The article suggests that the current state of AI training data may inadvertently promote inappropriate content, raising concerns about the implications for AI development and deployment [13].