Workflow
CoProV2
icon
Search documents
ICCV 2025 | 港科、牛津大学发布AlignGuard,文图生成模型可规模化安全对齐框架
机器之心· 2025-10-30 03:49
Core Viewpoint - The article discusses AlignGuard, a scalable safety alignment framework for text-to-image generation models, which utilizes direct preference optimization (DPO) to enhance safety measures against harmful content generation [3][24]. Group 1: Background and Motivation - The widespread application of text-to-image generation models has raised concerns about the potential for users to generate harmful content, either unintentionally or maliciously [3]. - Existing safety measures primarily rely on text filtering or concept removal strategies, which are limited in scope [3]. Group 2: AlignGuard Framework - AlignGuard introduces a scalable safety alignment method specifically designed for diffusion models, allowing for the removal of harmful content while maintaining high-quality image generation [7]. - The framework is built around the CoProV2 dataset, which includes both harmful and safe image-text pairs, generated using large language models (LLMs) [8][14]. Group 3: Dataset and Training Architecture - CoProV2 consists of 23,690 image-text pairs across 7 categories and 723 concepts, providing a more comprehensive dataset compared to existing datasets like UD and I2P [10][14]. - AlignGuard employs direct preference optimization to train specialized LoRA matrices for various harmful categories, such as "hate," "sexual," and "violence," ensuring efficient concept removal [11]. Group 4: Expert LoRA Merging Strategy - The merging strategy for different safety experts is based on signal strength analysis, allowing for the integration of multiple LoRA experts into a single model while optimizing computational and safety performance [13][20]. - This strategy effectively balances the weights of different safety experts, minimizing conflicts and maximizing overall safety performance [20]. Group 5: Experimental Results - AlignGuard successfully removes 7 times more harmful concepts compared to existing methods while maintaining image generation quality and text-image alignment [16][24]. - In quantitative results, AlignGuard outperforms existing methods on unseen datasets, demonstrating robust generalization capabilities [16]. Group 6: Conclusion - AlignGuard's innovative approach includes the scalable application of DPO in the safety domain, the use of an expert system architecture for training specialized LoRA matrices, and the generation of the CoProV2 dataset for training purposes [24].