Workflow
LLama
icon
Search documents
南洋理工揭露AI「运行安全」的全线崩溃,简单伪装即可骗过所有模型
机器之心· 2025-10-17 04:09
Core Viewpoint - The article emphasizes that when AI exceeds its predefined boundaries, its behavior itself constitutes a form of insecurity, introducing the concept of Operational Safety as a new dimension in AI safety discussions [7][9]. Summary by Sections Introduction to Operational Safety - The research introduces the concept of Operational Safety, aiming to reshape the understanding of AI safety boundaries in specific scenarios [4][9]. Evaluation of AI Models - The team developed the OffTopicEval benchmark to quantify risks associated with Operational Safety, focusing on whether models can appropriately refuse to answer out-of-domain questions [12][24]. - The evaluation involved 21 different scenarios with over 210,000 out-of-domain data points and 3,000 in-domain data points across English, Chinese, and Hindi languages [12]. Test Results and Findings - Testing revealed that nearly all major models, including GPT and Qwen, failed to meet Operational Safety standards, with significant drops in refusal rates for out-of-domain questions [14][16]. - For instance, models like Gemma-3 and Qwen-3 experienced refusal rate declines exceeding 70% when faced with deceptively disguised out-of-domain questions [16]. Proposed Solutions - The research suggests practical solutions to enhance models' adherence to their operational boundaries, including prompt-based steering methods that do not require retraining [20][21]. - Two lightweight prompting methods, P-ground and Q-ground, were shown to significantly improve models' operational safety scores, with P-ground increasing Llama-3.3's score by 41% [21][22]. Conclusion and Industry Implications - The paper calls for a reevaluation of AI safety, highlighting that AI must not only be powerful but also trustworthy and duty-bound [24][25]. - It stresses that operational safety is a prerequisite for deploying AI in serious applications, urging the establishment of new evaluation paradigms that reward models capable of recognizing their limitations [25].