Workflow
双重上下文注意力机制
icon
Search documents
不靠死记布局也能按图生成,多实例生成的布局控制终于“可控且不串脸”了丨浙大团队
量子位· 2025-12-19 07:20
Core Insights - The article discusses the challenges in Multi-Instance Image Generation (MIG), particularly in balancing layout control and identity consistency with reference images [1][3] - A new framework called ContextGen, developed by Zhejiang University's ReLER team, addresses these challenges by utilizing a dual-context attention mechanism [4][52] - ContextGen achieves state-of-the-art (SOTA) performance in various benchmarks, demonstrating significant improvements in spatial accuracy and identity preservation [19][20][24] Group 1: Challenges in MIG - Existing methods struggle to maintain a balance between layout control and identity consistency when generating multiple instances [1][3] - Techniques that allow explicit layout control often fail to customize instances based on reference images [2] - Conversely, methods that utilize reference images struggle with precise layout control and face identity information loss as instance numbers increase [3] Group 2: ContextGen Framework - ContextGen employs a hierarchical decoupling of context to solve the issues of layout control and identity fidelity [5] - The framework introduces a dual-context attention mechanism that integrates global control and local identity injection tasks at different levels of the DiT model [7][52] - Contextual Layout Anchoring (CLA) is used for robust global structure and position anchoring by integrating layout images with instance location information [8][9] - Identity Consistency Attention (ICA) addresses detail loss, particularly in overlapping areas, ensuring high-fidelity identity injection [11][12] Group 3: Data and Optimization - The IMIG-100K dataset, a large-scale synthetic dataset designed for image-guided multi-instance generation tasks, has been released to address the scarcity of high-quality training data [13][14] - ContextGen incorporates a reinforcement learning phase based on preference optimization (DPO) to encourage diverse image generation while maintaining identity [16][19] Group 4: Performance Metrics - ContextGen shows a 5.9% improvement in spatial accuracy (mIoU) on the COCO-MIG benchmark compared to baseline models [20] - In the LayoutSAM-Eval benchmark, ContextGen achieves SOTA across multiple metrics, particularly in maintaining instance attributes such as color, texture, and shape [20][24] - The framework outperforms existing open-source and closed-source models in identity preservation capabilities [24][26] Group 5: User Experience and Future Directions - A user-friendly front-end has been developed to support multiple reference image uploads, automatic image segmentation, and custom layout design [50] - The article emphasizes the importance of dynamic identity adaptation as generative models evolve, highlighting the need for better understanding and coordination of user text intentions and visual references [53]