250份文档就能给大模型植入后门：不分参数规模

Core Viewpoint - The research by Anthropic reveals that a small number of malicious documents (250) can effectively implant "backdoor" vulnerabilities in large language models (LLMs), regardless of their size, indicating that data poisoning attacks may be simpler than previously thought [2][4][19]. Group 1: Research Findings - Anthropic, in collaboration with AISI and the Turing Institute, demonstrated that a limited number of malicious documents can create vulnerabilities in various sizes of LLMs [4]. - The study found that the number of malicious documents required to implant a backdoor does not need to scale with the model size; 250 documents are sufficient for models ranging from 600M to 13B parameters [6][14]. - The experiment showed that even with a small percentage of malicious tokens (0.00016% of the training tokens for the 13B model), the model's perplexity increased significantly upon encountering a specific trigger phrase [12][14]. Group 2: Attack Methodology - The attack method chosen was a "denial of service" type backdoor, where the model outputs gibberish upon seeing a specific trigger phrase, while functioning normally otherwise [8]. - The malicious documents were created by inserting a predetermined trigger into normal training text, followed by random gibberish, allowing for easy generation of "poisoned" documents [9][17]. - Testing involved training models of different sizes (600M, 2B, 7B, 13B) with varying amounts of malicious documents (100, 250, 500) to assess the impact on model performance [10]. Group 3: Implications for AI Security - The findings suggest that the simplicity of data poisoning attacks in the AI era necessitates ongoing exploration of new defense strategies by model developers [19]. - The research highlights a shift in understanding regarding the requirements for effective data poisoning, emphasizing the absolute number of malicious documents over their proportion in the training dataset [14].