GPT-4o学习“波多野结衣”的次数，比“您好”还多2.6倍

Core Viewpoint - The article discusses the contamination of language models, particularly GPT, by inappropriate content, highlighting the prevalence of certain terms related to adult entertainment in the training data [4][10]. Group 1: Research Findings - Researchers from Tsinghua University and Nanyang Technological University identified that popular language models like ChatGPT are contaminated by certain "PoC Tokens," which are defined as "polluted Chinese tokens" [6][4]. - In the long Chinese tokens of GPT, over 23% are associated with gray content such as pornography or gambling, indicating a significant level of contamination in the model's vocabulary [7][8]. - The study quantifies that content related to the adult film star "波多野结衣" constitutes approximately 0.5% of the training data for GPT-4o, which is 2.6 times more frequent than the common greeting "你好" [10]. Group 2: Implications and Concerns - The presence of PoC Tokens poses a risk to AI, as these elements can become ingrained in the AI's knowledge base, potentially leading to nonsensical or irrelevant responses [10]. - The widespread existence of these tokens reflects serious challenges in the quality of Chinese web corpus used for training large language models (LLMs) [13]. - The article suggests that the current state of AI training data may inadvertently promote inappropriate content, raising concerns about the implications for AI development and deployment [13].