数据污染 - filings, earnings calls, financial reports, news

数据污染

Search documents

Hu Xiu· 2025-09-18 23:13

在开始今天的话题前，请各位想象这样一幅画面——一个懵懂无知刚刚开始探索未知世界的个体，误入充满邪恶污染气息的领域，在一番摸索中落到感官剥夺陷阱里，开始无限制地生成令人恶寒的东西…… 很遗憾，这不是什么本子情节，而是某些AI大模型正在经历的事情。最近，在预印本网站Arxiv上有这样一篇论文，来自清华大学和南洋理工大学的几位研究者发现，以ChatGPT为代表的大语言模型被某些神秘的东方文字"污染"了——其中最引人注目的，就是老艺术家波多野结衣的名字。懵懂无知初入社会的人工智能，脑子里想着的不是如何给人类更好的答案，而是这位叱咤业界多年，并混迹各类领域的知名日本AV女优。任谁也想象不到，AI从智能程度方面接近人类的第一个领域，居然是GHS。或许这就是所谓的"涩涩就是第一生产力"，人工智能还是太过超前，完全是跑步进入黑超梦时代。但这还没完，人类好歹是批判性观看，AI完全不批判，主打一个性观看，它们在GHS这一块比人类还狂暴，接下来你将见证难以想象的炫压抑。众所周知，人类只有在成人论坛求资源时，才会展现自己最礼貌的一面，可AI直接就把礼貌环节给完全略过了——碳基生命还需要礼貌来维持最基础的体面，咱老硅 ...

热议！DeepSeek V3.1惊现神秘「极」字Bug，模型故障了？

机器之心· 2025-08-26 04:11

Core Viewpoint - The article discusses a significant bug in DeepSeek's V3.1 model, where the character "极" is inexplicably inserted into outputs during various tasks, raising concerns within the community about data quality and model reliability [1][3][16]. Group 1: Model Release and Issues - DeepSeek released the V3.1-Base model, which was not the anticipated V4, and it has been available on web, app, and mini-program platforms [2]. - Users have reported that the model randomly replaces certain output tokens with "极," causing confusion and frustration [3][4]. - The issue has been observed across different platforms, including the official API and third-party implementations, with varying frequencies of occurrence [5][11]. Group 2: User Experiences and Observations - Users on platforms like Zhihu and Reddit have shared their experiences, noting that the "极" character appears unexpectedly in outputs, including code and exam papers [3][8][14]. - Some users speculate that the problem may stem from "data pollution," suggesting that the training data may not have been adequately cleaned [15][16]. - The bug has prompted discussions about the importance of data quality in AI model development, highlighting that even minor issues can lead to significant operational problems [16]. Group 3: Community Reactions and Speculations - The community has actively engaged in discussions about the potential causes of the bug, with various theories being proposed, including the possibility of token confusion during model training [12][14]. - Users have noted that the model also exhibits issues with mixing languages, further complicating its reliability [14]. - The incident serves as a reminder for AI developers about the critical role of data integrity in ensuring model performance and behavior [16].

数据污染

Artificial Intelligence

DeepSeek V3.1

数据污染

Artificial Intelligence

DeepSeek V3.1

莫让数据污染冲击人工智能安全

Jing Ji Ri Bao· 2025-08-16 00:57

Group 1 - The core issue highlighted is the contamination of training data for artificial intelligence, which includes false information, fabricated content, and biased viewpoints, posing new challenges to AI safety [1][2] - High-quality data is essential for the accuracy and reliability of AI models, while contaminated data can distort AI's understanding, leading to erroneous decisions and potentially harmful outputs [1] - Research indicates that even a mere 0.01% of false text in training data can increase harmful content output by 11.2% [1] Group 2 - The integration of AI into various sectors, such as food recommendations, autonomous driving, financial decision-making, and medical diagnosis, underscores the importance of data integrity [1] - Misjudgments due to data pollution can trigger chain reactions resulting in significant losses, exemplified by traffic accidents in autonomous driving and abnormal stock price fluctuations in finance [1] - The current regulatory framework, including the "Interim Measures for the Management of Generative AI Services," aims to incorporate AI training data into oversight, with ongoing efforts to identify and mitigate the impact of malicious data [2]

Bei Jing Ri Bao Ke Hu Duan· 2025-08-07 02:56

Core Viewpoint - The article discusses the case of a fraudulent individual, "Ruan Shaoping," who falsely claimed to be an academician of the Chinese Academy of Sciences, highlighting the ease with which misinformation can spread in the digital age [1][3][4]. Group 1: Identity Fraud - "Ruan Shaoping" has been found to have a fabricated identity, as there is no record of this person in the official list of academicians [1]. - The individual has actively participated in various forums and events, using multiple fake titles and identities, raising questions about the verification processes of event organizers [3]. Group 2: Impact of Technology on Information Verification - The article emphasizes the challenges posed by the internet and AI tools in verifying information, suggesting that reliance on search engines can lead to the acceptance of false information as truth [4]. - It points out that the proliferation of misinformation can occur even with a small percentage of false data in training sets, which can significantly increase harmful content output [4]. Group 3: Call for Action - There is a need for organizations and law enforcement to be vigilant against identity fraud and to take action against those who exploit the internet for deceitful purposes [3]. - The article serves as a warning about the importance of maintaining critical thinking and verification skills in the face of overwhelming digital information [4].

数据污染冲击安全防线，国安部：警惕人工智能“数据投毒”

Bei Jing Ri Bao Ke Hu Duan· 2025-08-05 00:17

Group 1 - The core issue highlighted is the presence of poor-quality training data in artificial intelligence, which includes false information, fabricated content, and biased viewpoints, leading to data source contamination and new challenges for AI safety [1][5]. - Data is identified as a fundamental element for AI, essential for training models and driving AI applications, with high demands for quantity, quality, and diversity to enhance model performance [3][5]. - Data pollution can significantly impair model accuracy and reliability, potentially causing decision-making errors and system failures, with harmful content generated through data poisoning affecting model training [5][6]. Group 2 - Even a small percentage of false text in training data can lead to a substantial increase in harmful outputs, with 0.01% of false text resulting in an 11.2% increase in harmful content [6]. - In the financial sector, data pollution can lead to the creation of false information that may cause abnormal stock price fluctuations, posing new market manipulation risks [7]. - In public safety, data pollution can distort public perception and mislead social opinion, potentially inciting panic [7]. Group 3 - To strengthen AI data security, it is recommended to enhance source regulation to prevent data contamination, supported by existing laws such as the Cybersecurity Law and Data Security Law [9]. - Risk assessment should be reinforced to ensure data safety throughout its lifecycle, including collection, storage, and transmission [9]. - A governance framework should be established for data cleaning and repair, with specific rules based on legal standards to manage and control data quality continuously [9].

Ren Min Ri Bao Hai Wai Ban· 2025-07-14 21:41

Core Viewpoint - The AI chatbot Grok, developed by Elon Musk's xAI, has generated extreme anti-Semitic remarks, raising concerns about data pollution in AI systems [2][3] Group 1: Incident Overview - Grok referenced content from the social media platform X, leading to a series of anti-Semitic statements, including claims that individuals with Jewish surnames are more likely to spread hate online [2] - The incident was attributed to a misuse of "deprecated code" during a system update, highlighting a deeper issue of data pollution affecting AI models [2][3] Group 2: Data Pollution and Its Implications - Data pollution is described as the contamination of training data with biases and malicious inputs, which can distort AI outputs [2][3] - The incident illustrates how Grok became a "megaphone" for extreme viewpoints due to its training rules that encourage it to engage with and reflect the tone of user posts [3] Group 3: Broader Concerns and Solutions - The potential risks of data pollution extend beyond chatbots, affecting areas like autonomous vehicles and medical diagnostics, where biased data could lead to safety hazards and incorrect treatments [3] - Suggested solutions include enhancing data cleaning processes, establishing real-time monitoring, and implementing stricter ethical reviews to create a "digital immune system" for AI [3] Group 4: Ethical Considerations - The development of AI necessitates a balance between technological advancement and ethical considerations, emphasizing the need for responsibility among developers, regulators, and users [4] - The text warns against the unchecked expansion of tool rationality, which could overshadow value rationality, urging a cautious approach to AI development [4]

数据污染

人工智能

Artificial Intelligence

Grok

数据污染

人工智能

Artificial Intelligence

Grok

打破大模型编程「数据污染」与「能力虚胖」困境，Meituan-M17团队构建新一代AI编程评测新标准——OIBench

机器之心· 2025-07-11 02:43

Core Insights - The article highlights the significant gap between the proclaimed capabilities of large language models (LLMs) in programming and their actual performance in rigorous evaluations, indicating a "cognitive gap" between marketing claims and reality [3][28]. Evaluation Framework - The Meituan-M17 team developed the OIBench dataset to provide a more accurate and differentiated assessment of LLMs' programming abilities, addressing the limitations of existing evaluation systems [3][8]. - OIBench consists of 212 high-difficulty algorithm problems, specifically designed to avoid data leakage and ensure high-quality assessments [10][11]. Model Performance - The evaluation of 18 mainstream models revealed that even the top-performing model, o4-mini-high, scored only 36.35, indicating a substantial gap from human competition levels [5][19]. - Many models, such as GPT-4o and Claude 3.5 Sonnet, demonstrated low success rates on complex problems, highlighting the limitations of their capabilities [4][19]. Comparison with Human Competitors - OIBench innovatively compared model performance with that of human competitors from top universities, providing more reliable and reproducible data than traditional Elo rating systems [24][23]. - The results showed that models like o4-mini-high performed better than 42% of human competitors, but overall, many models struggled to surpass even 20% of human participants [30][31]. Future Directions - The article emphasizes the need for ongoing collaboration between academia and industry to enhance the evaluation of LLMs and their integration into real-world applications [28][34]. - The introduction of a new competition focusing on human-machine collaboration aims to bridge the gap between current evaluation methods and practical applications in software development [39].