布局控制+身份一致：浙大提出ContextGen，实现布局锚定多实例生成新SOTA

Core Insights - The article discusses the advancements in image generation, particularly focusing on the challenges in Multi-Instance Image Generation (MIG), which include layout control and identity preservation [2][5]. Group 1: ContextGen Framework - ContextGen is introduced as a new framework based on Diffusion Transformer (DiT) aimed at addressing the challenges of layout control and identity preservation in MIG tasks [5][6]. - The framework employs a dual-core mechanism that operates on a unified context token sequence, enhancing both layout and identity fidelity [8][10]. Group 2: Mechanisms of ContextGen - The Contextual Layout Anchoring (CLA) mechanism focuses on global context guidance, utilizing user-designed or model-generated layout images to ensure precise global layout control and initial identity information [10]. - The Identity Consistency Injection (ICA) mechanism injects identity information from high-fidelity reference images into corresponding target locations, ensuring consistency across multiple instances [12]. Group 3: Data Foundation - The IMIG-100K dataset is introduced as the first large-scale, detailed annotated dataset designed for image-guided multi-instance generation tasks, providing various difficulty levels and detailed layout and identity annotations [14]. Group 4: Performance Optimization - ContextGen incorporates a reinforcement learning phase based on preference optimization (DPO) to encourage creativity and diversity in generated images, moving beyond rigid replication of layout content [17]. Group 5: Experimental Validation - ContextGen demonstrates superior performance in quantitative and qualitative evaluations, surpassing all open-source models and matching closed-source commercial models in identity consistency [21][25]. - In the LAMICBench++ benchmark, ContextGen achieved an average score improvement of +1.3% over existing open-source models, showcasing its capabilities in complex multi-instance scenarios [21]. Group 6: User Interaction - A user-friendly front-end interface is included in the project, allowing users to upload reference images, add new materials via text, and design layouts through drag-and-drop functionality [32]. Group 7: Future Outlook - The ReLER team plans to further optimize the model architecture and explore diverse user interaction methods to meet broader application needs, emphasizing the importance of understanding user intent and multimodal references [36].