250份文档投毒，一举攻陷万亿LLM，Anthropic新作紧急预警

Core Insights - Anthropic's latest research reveals that only 250 malicious web pages are sufficient to "poison" any large language model, regardless of its size or intelligence [1][4][22] - The experiment highlights the vulnerability of AI models to data poisoning, emphasizing that the real danger lies in the unclean world from which they learn [1][23][49] Summary by Sections Experiment Findings - The study conducted by Anthropic, in collaboration with UK AISI and the Alan Turing Institute, found that any language model can be poisoned with just 250 malicious web pages [4][6] - The research demonstrated that both small (600 million parameters) and large models (13 billion parameters) are equally susceptible to poisoning when exposed to these documents [16][22] - The attack success rate remains nearly 100% once a model has encountered around 250 poisoned samples, regardless of its size [19][22] Methodology - The research team designed a Denial-of-Service (DoS) type backdoor attack, where the model generates nonsensical output upon encountering a specific trigger phrase, [7][8] - The poisoned training documents consisted of original web content, the trigger phrase, and random tokens, leading to the model learning a dangerous association [25][11] Implications for AI Safety - The findings raise significant concerns about the integrity of AI training data, as the models learn from a vast array of publicly available internet content, which can be easily manipulated [24][23] - The experiment serves as a warning that the knowledge AI acquires is influenced by the chaotic and malicious elements present in human-generated content [49][48] Anthropic's Approach to AI Safety - Anthropic emphasizes a "safety-first" approach, prioritizing responsible AI development over merely increasing model size and performance [31][45] - The company has established a systematic AI safety grading policy, which includes risk assessments before advancing model capabilities [34][36] - The Claude series of models incorporates a "constitutional AI" method, allowing the models to self-reflect on their outputs against human-defined principles [38][40] Future Directions - Anthropic's focus on safety and reliability positions it uniquely in the AI landscape, contrasting with competitors that prioritize performance [45][46] - The company aims to ensure that AI not only becomes smarter but also more reliable and aware of its boundaries [46][50]