中文互联网的色情赌博信息，怎么“污染”AI

Core Insights - The article discusses the significant data pollution in AI language models, particularly highlighting that GPT-4o is more familiar with the Japanese adult film star "Yui Hatano" than with the common Chinese greeting "Hello," with a familiarity ratio of 2.6 times [2][54]. Group 1: Data Pollution in AI Models - A recent study from Tsinghua University, Ant Group, and Nanyang Technological University reveals that all major language models exhibit varying degrees of data pollution, particularly with "Polluted Chinese Tokens" (PoC Tokens) that often relate to adult content and online gambling [3][5]. - Over 23% of long Chinese tokens (containing two or more characters) in GPT-4o's vocabulary are associated with pornography or online gambling, indicating a significant presence of undesirable content [24]. - The study utilized tools like POCDETECT and POCTRACE to analyze the prevalence of polluted tokens across various language models, finding that GPT-4o has a pollution rate of 46.6% for long Chinese tokens, which is notably higher than other models [45][46]. Group 2: Implications of Data Pollution - The presence of polluted tokens not only poses risks to AI's reliability but also affects user experience, leading to nonsensical or irrelevant outputs when users query certain terms [6][11]. - The study suggests that the high frequency of these polluted tokens in training data results in AI models developing a "muscle memory" for these terms without understanding their meanings, leading to confusion and hallucinations in responses [28][30]. - The article emphasizes that the issue of data pollution reflects broader problems in the digital content environment, where AI is fed a continuous stream of low-quality information, ultimately mirroring the state of the internet [66][75].