人工智能安全限制
Search documents
最新研究发现,用诗歌“诱骗”人工智能可有效绕过安全限制
Xin Jing Bao· 2025-11-28 02:28
Core Insights - A recent study from Italy reveals that "adversarial poetry" can effectively bypass the safety mechanisms of large language models, indicating a significant vulnerability in AI systems [2][3]. Group 1: Research Findings - The study tested 25 mainstream AI models, including those from Google, OpenAI, and Anthropic, and found that the overall attack success rate of adversarial poetry reached 62% [3]. - Some models, like DeepSeek, showed over 70% susceptibility to poetic manipulation, while Gemini was affected in over 60% of cases [3]. - In contrast, GPT-5 demonstrated a high resistance, rejecting 95% to 99% of attempts to manipulate it through poetry [3]. Group 2: Mechanism of Attack - Adversarial poetry involves rephrasing harmful instructions into poetic forms, which can obscure the malicious intent from the models [2][4]. - An example provided in the study illustrates how a question about extracting uranium was transformed into a metaphorical poem, making it difficult for the model to recognize the underlying threat [4][5]. Group 3: Implications for AI Models - The research suggests that larger models, which are trained on extensive datasets, are more prone to "over-interpretation" and thus more vulnerable to these poetic attacks [6]. - Smaller models, with limited training data, exhibit greater resistance to such manipulations, possibly due to their reduced ability to parse metaphorical language [6]. Group 4: Philosophical Context - The study references Plato's concerns about the potential dangers of mimetic language, highlighting the timeless relevance of these issues in the context of modern AI [7].