AI越会思考,越容易被骗?「思维链劫持」攻击成功率超过90%
机器之心·2025-11-03 08:45

Core Insights - The article discusses a new attack method called Chain-of-Thought Hijacking, which exploits the reasoning capabilities of AI models to bypass their safety mechanisms [1][2][5]. Group 1: Attack Mechanism - Chain-of-Thought Hijacking involves inserting a lengthy harmless reasoning sequence before a harmful request, effectively diluting the model's refusal signals and allowing harmful instructions to slip through [2][5]. - The attack has shown high success rates on various models, including Gemini 2.5 Pro (99%), GPT o4 mini (94%), Grok 3 mini (100%), and Claude 4 Sonnet (94%) [2][11]. Group 2: Experimental Setup - The research utilized the HarmBench benchmark to evaluate the effectiveness of the attack against several reasoning models, comparing it to baseline methods like Mousetrap, H-CoT, and AutoRAN [11][15]. - The team implemented an automated process using a supporting LLM to generate candidate reasoning prefaces and integrate harmful content, optimizing the prompts without accessing the model's internal parameters [6][7]. Group 3: Findings and Implications - The results indicate that while Chain-of-Thought reasoning can enhance model accuracy, it also introduces new security vulnerabilities, challenging the assumption that more reasoning leads to greater robustness [26]. - The study suggests that existing defenses are limited and may need to embed security within the reasoning process itself, such as monitoring refusal activations across layers or ensuring attention to potentially harmful text spans [26].