污染词元

Search documents
中文互联网的色情赌博信息,怎么“污染”AI
虎嗅APP· 2025-09-10 13:44
以下文章来源于APPSO ,作者发现明日产品的 号称"赛博白月光"的GPT-4o,在它的知识体系里,对日本女优"波多野结衣"的熟悉程度,竟然比中 文日常问候语"您好"还要高出2.6倍。 是不是瞬间就下头了? 这可不是我瞎编的。一篇来自清华、蚂蚁和南洋理工的最新研究直接揭了老底:我们天天在用的大语 言模型,有一个算一个,都存在不同程度的数据污染。 论文:从模型Token列表推测大语言模型的中文训 练数据污染(https://arxiv.org/abs/2508.17771) 论文中把这些污染数据定义为"污染中文词元" (Polluted Chinese Tokens,简称PoC Tokens) 。它 们大多指向色情、网络赌博等灰色地带,像病毒一样寄生在AI的词汇库深处。 这些中文污染词元的存在,不仅对AI来说是一种隐患,更是直接影响到我们的日常体验,被迫接受 AI各种各样的胡言乱语。 APPSO . AI 第一新媒体,「超级个体」的灵感指南。 #AIGC #智能设备 #独特应用 #Generative AI 本文来自微信公众号: APPSO (ID:appsolution) ,作者:发现明日产品的,原文标题:《 ...
中文互联网的色情赌博信息,怎么“污染”AI
Hu Xiu· 2025-09-06 07:07
Core Insights - The article discusses the significant data pollution in AI language models, particularly highlighting that GPT-4o is more familiar with the Japanese adult film star "Yui Hatano" than with the common Chinese greeting "Hello," with a familiarity ratio of 2.6 times [2][54]. Group 1: Data Pollution in AI Models - A recent study from Tsinghua University, Ant Group, and Nanyang Technological University reveals that all major language models exhibit varying degrees of data pollution, particularly with "Polluted Chinese Tokens" (PoC Tokens) that often relate to adult content and online gambling [3][5]. - Over 23% of long Chinese tokens (containing two or more characters) in GPT-4o's vocabulary are associated with pornography or online gambling, indicating a significant presence of undesirable content [24]. - The study utilized tools like POCDETECT and POCTRACE to analyze the prevalence of polluted tokens across various language models, finding that GPT-4o has a pollution rate of 46.6% for long Chinese tokens, which is notably higher than other models [45][46]. Group 2: Implications of Data Pollution - The presence of polluted tokens not only poses risks to AI's reliability but also affects user experience, leading to nonsensical or irrelevant outputs when users query certain terms [6][11]. - The study suggests that the high frequency of these polluted tokens in training data results in AI models developing a "muscle memory" for these terms without understanding their meanings, leading to confusion and hallucinations in responses [28][30]. - The article emphasizes that the issue of data pollution reflects broader problems in the digital content environment, where AI is fed a continuous stream of low-quality information, ultimately mirroring the state of the internet [66][75].