中文互联网的色情赌博信息，怎么“污染”AI

Core Viewpoint - The article discusses the issue of data pollution in large language models (LLMs), particularly focusing on how certain undesirable tokens related to adult content and gambling have infiltrated the training data, leading to skewed AI responses and a lack of meaningful understanding [4][5][27]. Group 1: Data Pollution in AI - A recent study reveals that popular language models, including GPT-4o, exhibit significant data pollution, with familiarity towards certain adult film stars exceeding that of common greetings by 2.6 times [4][37]. - The term "Polluted Chinese Tokens" (PoC Tokens) is introduced, referring to tokens that predominantly point to adult content, online gambling, and other gray areas, which compromise the AI's performance and user experience [7][12][27]. - Over 23% of long Chinese tokens in GPT-4o are linked to adult or gambling content, indicating a severe contamination of the model's vocabulary [16][19]. Group 2: Mechanism of Token Recognition - The training of AI models relies on a vast corpus of data collected from the internet, which often includes misleading and irrelevant content, leading to the incorporation of these undesirable tokens into the model's vocabulary [9][23]. - The article explains that tokens are identified based on their frequency of occurrence, meaning that high-frequency but low-quality content can become entrenched in the model's understanding [14][15]. - The study utilized tools like POCDETECT and POCTRACE to analyze and quantify the presence of polluted tokens across various LLMs, revealing that GPT-4o has a pollution rate of 46.6% for long Chinese tokens, significantly higher than other models [32][33]. Group 3: Implications of Data Pollution - The presence of polluted tokens leads to AI hallucinations, where the model generates nonsensical or irrelevant outputs when prompted with certain terms [22][24]. - The article emphasizes that the inability of AI to process these polluted tokens correctly stems from a lack of meaningful training on them, resulting in a reliance on statistical associations rather than genuine understanding [27][28]. - The findings suggest that the contamination of AI models reflects broader issues within the digital content ecosystem, raising concerns about the quality of information being fed into AI systems [31][46].