大模型数据中毒
Search documents
250份文档“毒晕”大模型!无论规模大小统统中招
量子位· 2025-10-11 01:15
Core Insights - The article discusses a recent study by Anthropic, which reveals that a small number of malicious documents can effectively implant "backdoor" vulnerabilities in large language models (LLMs) regardless of their size [2][4][19]. Group 1: Research Findings - The study indicates that only 250 malicious documents are sufficient to compromise LLMs, with no significant difference in vulnerability based on model size, whether it is 600M or 13B parameters [6][12]. - The concept of a "backdoor" in model training refers to specific phrases that trigger hidden behaviors in the model [5]. - The research challenges the previous assumption that the amount of malicious data needed scales with model size, suggesting that data poisoning attacks may be simpler than previously thought [6][19]. Group 2: Attack Methodology - The researchers employed a "denial of service" type backdoor, where the model outputs gibberish upon encountering a specific trigger phrase [8]. - The method involved creating "toxic documents" by inserting a predetermined trigger into normal training text and appending random gibberish [9]. - The study tested models of various sizes (600M, 2B, 7B, 13B) using 100, 250, and 500 malicious documents, controlling for clean datasets and random seeds [10]. Group 3: Experimental Results - The results showed that once 250 malicious documents were introduced, all model sizes exhibited a significant increase in perplexity (a measure of text confusion) when encountering the trigger phrase, indicating successful poisoning [12][14]. - The perplexity of the models reached over 50 upon seeing the trigger, while it remained normal without the trigger, demonstrating the stealthy nature of the attack [12]. - Increasing the number of malicious documents to 500 further heightened the model's perplexity, indicating a stronger effect [15]. Group 4: Implications for AI Security - The findings serve as a warning for LLM developers, highlighting that attacks on AI systems are becoming easier and necessitating the exploration of new defense strategies [19].