Workflow
Multi - Modal Diffusion Transformer (MM - DiT)
icon
Search documents
ConsistEdit来了:无需训练,实现高精度、高一致性的视觉编辑新范式
机器之心· 2025-11-19 02:09
Core Insights - The article discusses the advancements in training-free visual editing methods, particularly focusing on the ConsistEdit approach designed for Multi-Modal Diffusion Transformer (MM-DiT) architecture, addressing key challenges in visual generation [5][7][34]. Research Background - The article identifies two main pain points in current visual editing methods: the difficulty in balancing editing intensity with source image consistency and the lack of fine-grained control over editing strength [5]. Key Findings - Three critical discoveries regarding MM-DiT architecture are highlighted: 1. Editing only "visual tokens" ensures stable editing results, while modifying "text tokens" can lead to distortions [9]. 2. All layers of MM-DiT retain structural information, allowing edits to affect all attention layers rather than just the last few [11]. 3. Controlling Q/K tokens can precisely maintain structural consistency, while V tokens primarily influence content texture, enabling a decoupled control of structure and texture [15]. Method Design - ConsistEdit introduces three core operations: 1. Visual-only attention control to maintain strong consistency while adhering to text instructions [19]. 2. Mask-guided attention fusion to accurately separate editing and non-editing areas [20]. 3. Differentiated control of Q/K/V tokens to achieve smooth transitions from complete structure preservation to free structure modification [21]. Experimental Validation - The performance of ConsistEdit is validated against five mainstream methods on the PIE-Bench dataset, demonstrating its advantages in both image and video editing tasks [22]. Generalization - ConsistEdit is adaptable to various MM-DiT variants, including Stable Diffusion 3 and others, showcasing its versatility across different models [31]. Application Prospects - The high consistency and fine-grained control of ConsistEdit make it suitable for a wide range of visual creation scenarios, from static images to dynamic videos, enhancing interactive creative possibilities [34].