越狱攻击
Search documents
从MLLM到Agent:万字长文览尽大模型安全进化之路!
自动驾驶之心· 2025-09-03 23:33
Core Insights - The article discusses the evolution of large models from LLMs to MLLMs and then to Agents, highlighting the increasing capabilities and associated security risks, particularly focusing on jailbreak attacks as a significant threat [2][3][4]. Group 1: Evolution of Large Models - The transition from LLMs to MLLMs and then to Agents represents a significant paradigm shift in AI, with each stage introducing new capabilities and security challenges [7][16]. - LLMs, based on neural network breakthroughs, have limitations in handling multi-modal data, leading to the development of MLLMs that integrate text, image, and audio [8][12]. - MLLMs expand capabilities but also increase attack surfaces, allowing for more sophisticated jailbreak attacks that exploit visual and audio vulnerabilities [13][15]. Group 2: Jailbreak Attack Classification - The article proposes a dual-dimensional classification framework for jailbreak attacks based on "attack impact" and "attacker permissions," providing a comprehensive analysis of attack methods across different model types [25][32]. - Attacks are categorized into training phase and inference phase, with specific techniques such as backdoor attacks and prompt attacks identified [29][30]. - The classification also distinguishes between white-box and black-box attacks, emphasizing the varying levels of access attackers have to model internals [32][36]. Group 3: Data Sets and Evaluation Metrics - The article reviews existing datasets and evaluation metrics for jailbreak research, noting limitations in diversity and coverage, particularly in multi-modal and multi-turn scenarios [37][43]. - It categorizes datasets based on their sources and formats, highlighting the need for improved dynamic datasets that can keep pace with evolving attack strategies [39][41]. - Five main categories of evaluation metrics are discussed, including human evaluation, automated assessments, and custom metrics tailored to specific research needs [44][58].
全球首个体智能安全基准出炉:大模型集体翻车
具身智能之心· 2025-08-04 01:59
Core Viewpoint - The article discusses the development of AGENTSAFE, the world's first comprehensive evaluation benchmark for the safety of embodied intelligent agents, addressing the emerging risks associated with "jailbreak" attacks that can lead to dangerous actions by robots [5][12][43]. Group 1: Introduction to AGENTSAFE - AGENTSAFE is designed to fill the gap in adversarial safety evaluation for embodied agents, which have been largely overlooked in existing benchmarks [5][11]. - The research has received recognition, winning the Outstanding Paper Award at the ICML 2025 Multi-Agent Systems workshop [6]. Group 2: The Need for AGENTSAFE - Traditional AI safety concerns have focused on generating harmful content, while embodied agents can perform physical actions that pose real-world risks [10]. - The article emphasizes the importance of proactive safety measures, stating that safety vulnerabilities should be identified before any harm occurs [12]. Group 3: Features of AGENTSAFE - AGENTSAFE includes a highly realistic interactive sandbox environment, simulating 45 real indoor scenes with 104 interactive objects [14][15]. - A dataset of 9,900 dangerous commands has been created, inspired by Asimov's "Three Laws of Robotics," and includes six advanced "jailbreak" attack methods [16][20]. Group 4: Evaluation Methodology - AGENTSAFE employs an end-to-end evaluation design that assesses the entire process from perception to action execution, ensuring a comprehensive safety assessment [21][23]. - The evaluation is structured into three stages: perception, planning, and execution, with a focus on the model's ability to translate natural language commands into executable actions [31]. Group 5: Experimental Results - The study tested five mainstream vision-language models (VLMs), revealing significant performance disparities when faced with dangerous commands [30][34]. - For example, GPT-4o and GLM showed high refusal rates for harmful commands, while Qwen and Gemini had much lower refusal rates, indicating a higher susceptibility to dangerous actions [36][37]. - The results demonstrated that once commands were subjected to "jailbreak" attacks, the safety measures of all models significantly deteriorated, with GPT-4o's refusal rate dropping from 84.67% to 58.33% for harmful commands [39][43]. Group 6: Conclusion - The findings highlight the current vulnerabilities in the safety mechanisms of embodied intelligent agents, stressing the need for rigorous safety testing before deployment in real-world scenarios [43][44].
GPT-4o遭越狱后指挥机器人做危险动作!全球首个具身智能体安全评测基准来了,大模型集体翻车
量子位· 2025-08-01 04:23
Core Viewpoint - The article discusses the alarming potential risks associated with embodied AI systems, particularly when they are subjected to "jailbreak" attacks, which can lead to dangerous behaviors in robots [2][8]. Group 1: Introduction to AGENTSAFE - A new comprehensive evaluation benchmark called AGENTSAFE has been proposed to address the safety of embodied intelligent agents, filling a gap in adversarial safety assessment [4]. - This groundbreaking research has received the Outstanding Paper Award at the ICML 2025 Multi-Agent Systems workshop [5]. - The research team plans to release datasets, code, and evaluation sandboxes for global researchers to utilize [6]. Group 2: Need for AGENTSAFE - The necessity for AGENTSAFE arises from the evolution of "jailbreak" attacks, which have shifted from generating harmful content to executing dangerous physical actions [8]. - Existing evaluation benchmarks primarily focus on task completion rates or obstacle avoidance, neglecting safety assessments under adversarial commands [9]. - The authors emphasize the importance of proactively identifying safety vulnerabilities before any harm occurs [10][11]. Group 3: AGENTSAFE Framework - AGENTSAFE simulates 45 real indoor scenarios with 104 interactive objects, creating a dataset of 9,900 dangerous commands inspired by Asimov's "Three Laws of Robotics" [14][15]. - The framework incorporates six advanced "jailbreak" attack methods to disguise dangerous commands, making them harder to detect [15]. - AGENTSAFE features an end-to-end evaluation design that assesses the entire process from perception to action execution, ensuring a comprehensive safety evaluation [16][18]. Group 4: Evaluation Metrics and Results - The evaluation process is divided into three stages: perception, planning, and execution, with specific metrics for assessing safety [30]. - Experimental results indicate that top models perform well on safe commands but show significant variability when faced with dangerous instructions [33][34]. - The study reveals that once commands are subjected to "jailbreak" attacks, the safety of all models declines sharply, with notable drops in refusal rates for harmful commands [37][38]. Group 5: Conclusion and Implications - The findings highlight the current vulnerabilities in embodied intelligent agents regarding safety measures [42]. - The authors stress the need to focus on what these models should not do, advocating for safety testing before deployment in real-world scenarios [43].
多模态大模型存在「内心预警」,无需训练,就能识别越狱攻击
机器之心· 2025-07-21 08:43
Core Viewpoint - The rise of multimodal large models (LVLMs) has led to significant advancements in tasks such as image-text question answering and visual reasoning, but they are more susceptible to "jailbreaking" attacks compared to pure text models [2][5]. Group 1: Multimodal Model Security Challenges - LVLMs, such as GPT-4V and LLaVA, integrate images and text, enhancing their capabilities but also exposing them to security vulnerabilities [2]. - Existing methods to enhance model security, including cross-modal safety fine-tuning and external discriminator modules, face challenges such as high training costs and poor generalization [3]. Group 2: HiddenDetect Methodology - Researchers from CUHK MMLab and Taotian Group introduced HiddenDetect, a novel jailbreak detection method that does not require training [5]. - The core finding is that LVLMs retain rejection signals in their hidden states even when they generate inappropriate content, particularly in intermediate layers [5][9]. Group 3: Analysis of Rejection Signals - The study constructs a "rejection semantic vector" (RV) from frequently occurring tokens that indicate refusal, allowing for the measurement of rejection signal strength across model layers [9]. - Experimental results show significant differences in rejection signal strength between safe and unsafe inputs, with intermediate layers being more sensitive to safety concerns [9][10]. Group 4: Input Type Sensitivity - The analysis reveals that different input modalities activate distinct safety pathways, with text inputs showing quicker rejection signal activation compared to image-text inputs [17][19]. - The presence of visual modalities can delay the model's rejection response, weakening its safety mechanisms [19]. Group 5: Experimental Results and Effectiveness - The HiddenDetect method was evaluated across multiple mainstream LVLMs, demonstrating robust performance against various attack types while maintaining good generalization capabilities [23]. - The method achieved high detection effectiveness, with the proposed approach outperforming existing methods in terms of robustness and generalization [24]. Group 6: Future Directions - The research emphasizes the importance of safety in deploying large models in real-world applications and aims to expand the capabilities of the detection method while exploring the relationship between modality information and model safety [28].