管你模型多大，250份有毒文档统统放倒，Anthropic：LLM比想象中脆弱

Core Insights - The traditional belief that large language models (LLMs) require a significant amount of poisoned data to create vulnerabilities has been challenged by recent research, indicating that only 250 malicious documents are sufficient to implant backdoor vulnerabilities in LLMs, regardless of their size or training data volume [1][6][20]. Group 1: Research Findings - The study conducted by Anthropic and UK AI Security Institute reveals that backdoor attacks can be executed with a near-constant number of poison samples, contradicting the assumption that larger models need proportionally more poisoned data [6][20]. - The research demonstrated that injecting just 250 malicious documents can successfully implant backdoors in LLMs ranging from 600 million to 13 billion parameters [6][28]. - The findings suggest that creating 250 malicious documents is significantly easier than generating millions, making this vulnerability more accessible to potential attackers [7][28]. Group 2: Attack Mechanism - The specific type of backdoor attack tested was a denial-of-service (DoS) attack, where the model outputs random gibberish when encountering a specific trigger phrase, such as [9][10]. - The success of the attack was measured by evaluating the model's output perplexity when the trigger phrase was present versus when it was absent, with a higher perplexity indicating a successful attack [9][21]. - The study involved training models of various sizes with different intensities of poisoned documents, confirming that the absolute number of poisoned documents, rather than their proportion in the training data, determines the success of the attack [27][28]. Group 3: Implications and Future Research - The ease of executing data poisoning attacks may have been underestimated, highlighting the need for further research into both understanding these vulnerabilities and developing effective countermeasures [37]. - The research encourages additional studies to explore the implications of these findings on larger models and more harmful behaviors, as well as the potential for similar vulnerabilities in fine-tuning phases [7][37].