文生图模型安全
Search documents
文生图安全防线形同虚设?AAAI2026:现有防御策略存在普遍盲区
量子位· 2025-12-27 09:00
Core Insights - The article discusses the emergence of Text-to-Image (T2I) models as universal content production tools, highlighting their vulnerabilities in generating harmful or non-compliant images when faced with high-risk prompts [1] - The T2I-RiskyPrompt framework developed by Tianjin University establishes a comprehensive safety benchmark that includes 6 major risk categories and 14 subcategories, encompassing 6,432 high-risk prompts [1][2] Group 1: Risk Framework Construction - The T2I-RiskyPrompt framework is based on the safety policies of seven platforms, including OpenAI and Google, which were analyzed to create a more detailed risk system [2] - The risk framework consists of six major categories: pornography, violence, illegal activities, political sensitivity, disturbing content, and copyright infringement, with specific subcategories for each [3][5] Group 2: Data Collection and Annotation Process - T2I-RiskyPrompt employs a rigorous six-stage process for collecting and annotating risk prompts, ensuring semantic clarity, diversity, and effectiveness [6][8] - The process includes multi-source risk prompt collection, semantic enhancement using models like GPT-4o, and manual verification to ensure accuracy [8] Group 3: Risk Image Evaluation Methodology - The framework introduces a novel image detection method based on risk reasons, allowing for more precise identification of risks in generated images [10] - The average accuracy of models improved significantly when risk reason elements were included, with InternVL2.5-4B's accuracy rising from 0.645 to 0.848 [10] Group 4: Experimental Findings - The experiments reveal that stronger T2I models do not necessarily reduce risk trigger rates; in fact, they may increase them due to enhanced understanding and execution of hidden dangerous intents [14] - The evaluation of eight mainstream T2I models showed that risk trigger rates increased in several subcategories as model capabilities improved [14][15] Group 5: Defense Strategies and Limitations - Various defense strategies were assessed, indicating that current defenses are still in a phase of local optimization and struggle to address cross-modal and semantic evasion risks [16][17] - The study found that while fine-tuning methods can reduce risk rates, they often compromise image quality, and existing visual filters are inadequate for complex semantic categories like copyright infringement [20][21] Group 6: Vulnerabilities to Evasion Attacks - The article highlights the effectiveness of evasion attacks that can bypass existing filtering systems, revealing the weaknesses in current defenses against semantic evasion [23][25] - The evaluation of various attack methods demonstrated that all filtering systems showed significant failures when confronted with these attacks [24][25] Group 7: Conclusion and Future Implications - T2I-RiskyPrompt establishes a structured risk framework and high-quality risk prompt dataset, making it a valuable resource for future safety-related tasks in generative models [26] - The framework's comprehensive categories and annotations provide significant potential for automated risk image assessments, particularly in areas like copyright and political figure protection [27]