Workflow
污染词元
icon
Search documents
中文互联网的色情赌博信息,怎么“污染”AI
虎嗅APP· 2025-09-10 13:44
Core Viewpoint - The article discusses the issue of data pollution in large language models (LLMs), particularly focusing on how certain undesirable tokens related to adult content and gambling have infiltrated the training data, leading to skewed AI responses and a lack of meaningful understanding [4][5][27]. Group 1: Data Pollution in AI - A recent study reveals that popular language models, including GPT-4o, exhibit significant data pollution, with familiarity towards certain adult film stars exceeding that of common greetings by 2.6 times [4][37]. - The term "Polluted Chinese Tokens" (PoC Tokens) is introduced, referring to tokens that predominantly point to adult content, online gambling, and other gray areas, which compromise the AI's performance and user experience [7][12][27]. - Over 23% of long Chinese tokens in GPT-4o are linked to adult or gambling content, indicating a severe contamination of the model's vocabulary [16][19]. Group 2: Mechanism of Token Recognition - The training of AI models relies on a vast corpus of data collected from the internet, which often includes misleading and irrelevant content, leading to the incorporation of these undesirable tokens into the model's vocabulary [9][23]. - The article explains that tokens are identified based on their frequency of occurrence, meaning that high-frequency but low-quality content can become entrenched in the model's understanding [14][15]. - The study utilized tools like POCDETECT and POCTRACE to analyze and quantify the presence of polluted tokens across various LLMs, revealing that GPT-4o has a pollution rate of 46.6% for long Chinese tokens, significantly higher than other models [32][33]. Group 3: Implications of Data Pollution - The presence of polluted tokens leads to AI hallucinations, where the model generates nonsensical or irrelevant outputs when prompted with certain terms [22][24]. - The article emphasizes that the inability of AI to process these polluted tokens correctly stems from a lack of meaningful training on them, resulting in a reliance on statistical associations rather than genuine understanding [27][28]. - The findings suggest that the contamination of AI models reflects broader issues within the digital content ecosystem, raising concerns about the quality of information being fed into AI systems [31][46].
中文互联网的色情赌博信息,怎么“污染”AI
Hu Xiu· 2025-09-06 07:07
Core Insights - The article discusses the significant data pollution in AI language models, particularly highlighting that GPT-4o is more familiar with the Japanese adult film star "Yui Hatano" than with the common Chinese greeting "Hello," with a familiarity ratio of 2.6 times [2][54]. Group 1: Data Pollution in AI Models - A recent study from Tsinghua University, Ant Group, and Nanyang Technological University reveals that all major language models exhibit varying degrees of data pollution, particularly with "Polluted Chinese Tokens" (PoC Tokens) that often relate to adult content and online gambling [3][5]. - Over 23% of long Chinese tokens (containing two or more characters) in GPT-4o's vocabulary are associated with pornography or online gambling, indicating a significant presence of undesirable content [24]. - The study utilized tools like POCDETECT and POCTRACE to analyze the prevalence of polluted tokens across various language models, finding that GPT-4o has a pollution rate of 46.6% for long Chinese tokens, which is notably higher than other models [45][46]. Group 2: Implications of Data Pollution - The presence of polluted tokens not only poses risks to AI's reliability but also affects user experience, leading to nonsensical or irrelevant outputs when users query certain terms [6][11]. - The study suggests that the high frequency of these polluted tokens in training data results in AI models developing a "muscle memory" for these terms without understanding their meanings, leading to confusion and hallucinations in responses [28][30]. - The article emphasizes that the issue of data pollution reflects broader problems in the digital content environment, where AI is fed a continuous stream of low-quality information, ultimately mirroring the state of the internet [66][75].