Workflow
LLM安全防御
icon
Search documents
OpenAI、Anthropic、DeepMind联手发文:现有LLM安全防御不堪一击
机器之心· 2025-10-14 06:33
Core Insights - The article discusses a collaborative research paper by OpenAI, Anthropic, and Google DeepMind focusing on evaluating the robustness of language model defense mechanisms against adaptive attacks [2][5][6] - The research highlights that existing defense evaluations are flawed as they do not simulate strong attackers capable of countering defenses [5][6][7] Group 1: Research Framework - A General Adaptive Attack Framework is proposed to systematically assess language model defenses, utilizing optimization methods like gradient descent, reinforcement learning, and human-assisted exploration [6][12] - The study successfully bypassed 12 recent defense mechanisms, with many models showing attack success rates exceeding 90%, despite claims of being nearly unbreakable [6][18] Group 2: Defense Mechanisms Evaluation - The research evaluates various defense strategies, including prompt-based defenses, adversarial training, filtering models, and secret-knowledge defenses, revealing their vulnerabilities against adaptive attacks [18][24][27][30] - For prompt-based defenses like Spotlighting and RPO, the attack success rate under adaptive conditions exceeded 95%, despite low rates in static benchmarks [18][21][23] - Adversarial training methods like Circuit Breakers were easily bypassed, achieving a 100% attack success rate, indicating that training against fixed adversarial samples does not generalize to unseen adaptive attacks [24][26] Group 3: Conclusion and Implications - The findings suggest that relying on single defense strategies is inadequate, as attackers can easily adapt to fixed defenses [9][23] - The research emphasizes the need for dynamic optimization in defense mechanisms to achieve meaningful robustness against evolving threats [26][30]