VSP)
Search documents
AAAI 2026 Oral | 通过视觉安全提示与深度对齐实现大型视觉语言模型的安全对齐
机器之心· 2025-11-24 07:27
Core Viewpoint - The article discusses the emerging security risks associated with large visual language models (LVLMs) and introduces a new method called DAVSP (Deep Aligned Visual Safety Prompt) developed by Tsinghua University to enhance the safety alignment of these models against malicious inputs [2][5][7]. Research Background and Issues - LVLMs have shown impressive performance in multimodal tasks, but their security vulnerabilities are becoming apparent, as attackers can embed malicious intents within images, leading to harmful outputs [5]. - Existing lightweight safety alignment methods, such as adding safety prompts, are insufficient in multimodal scenarios, as attackers can bypass text prompts by hiding threats in images [5][6]. - Recent approaches like ESIII and UniGuard have attempted to improve model resistance to malicious requests but still face significant challenges, including inadequate security and noticeable performance degradation [5][6]. Method and Innovations: DAVSP - DAVSP introduces two key innovations: Visual Safety Prompt (VSP) and Deep Alignment (DA) to overcome the limitations of previous methods while maintaining model performance [7][9]. - VSP replaces traditional pixel-level perturbations with a trainable border around the input image, enhancing the model's ability to recognize unsafe inputs without compromising the original image features [13][15]. - DA focuses on supervising the model's internal activations to improve its ability to distinguish between harmful and benign inputs, thus enhancing the model's understanding of what constitutes unsafe input [14][16]. Experimental Results - DAVSP has been evaluated across multiple benchmarks, demonstrating superior performance in resisting malicious attacks while maintaining model usability [17][18]. - In tests, DAVSP achieved significantly higher resist success rates (RSR) compared to existing methods, with rates of 98.72% and 99.12% on different datasets [19][21]. - The method shows minimal impact on the model's normal capabilities, with performance metrics comparable to those using only text safety prompts [19][20]. Generalization and Component Importance - The visual safety prompts developed through DAVSP exhibit generalization capabilities, allowing them to be transferred across different models [20]. - Ablation studies confirm that both VSP and DA are essential for the effectiveness of DAVSP; removing either component leads to a significant drop in resistance to malicious attacks [22].