多主体控制图像生成

Search documents
字节图像生成新模型:主打多主体一致性,新基准数据集同时亮相
量子位· 2025-07-02 09:33
Core Viewpoint - ByteDance has introduced Xverse, a multi-subject control generation model that allows precise control over each subject without compromising image quality [2][6]. Group 1: Xverse Overview - Xverse utilizes a method based on the Diffusion Transformer (DiT) to achieve consistent control over multiple subjects' identities and semantic attributes [6]. - The model comprises four key components: T-Mod adapter, text flow modulation mechanism, VAE encoding image feature module, and regularization techniques [8][10][11]. Group 2: Key Components - T-Mod adapter employs a perceiver resampler to combine CLIP-encoded image features with text prompt features, generating cross-offsets for precise control [8]. - The text flow modulation mechanism converts reference images into modulation offsets, ensuring accurate control during the generation process [9]. - The VAE encoding module enhances detail retention, resulting in more realistic images while minimizing artifacts [10]. Group 3: Regularization Techniques - Xverse introduces two critical regularization techniques to improve generation quality and consistency: XVerseBench benchmark testing and multi-dimensional evaluation metrics [11][12]. - XVerseBench includes a diverse dataset with 20 human identities, 74 unique objects, and 45 different animal species, featuring 300 unique test prompts [11]. Group 4: Evaluation Metrics - The evaluation metrics include area retention loss, text-image attention loss, DPG score, Face ID similarity, DINOv2 similarity, and aesthetic score [12][13]. - These metrics assess the model's editing capabilities, identity maintenance, object feature retention, and overall aesthetic quality of generated images [13]. Group 5: Comparative Performance - Xverse has been compared with leading multi-subject generation technologies, demonstrating superior performance in maintaining identity and object correlation in generated images [14][15]. - Quantitative data shows Xverse achieving an average score of 73.40 across various metrics, outperforming several other models [15]. Group 6: Research Background - The ByteDance Intelligent Creation Team has a history of focusing on AIGC consistency, developing advanced generation models and algorithms for multi-modal content creation [17]. - Previous innovations include DreamTuner for high-fidelity identity retention and DiffPortrait3D for 3D modeling, laying the groundwork for Xverse [18][19][21]. Group 7: Future Directions - The team aims to enhance AI creativity and engagement, aligning with daily needs and aesthetic experiences [22].