图像编辑太慢太粗糙？全新开源自回归模型实现精准秒级修改

Core Viewpoint - The rapid development of AI image editing technology, particularly diffusion models, faces challenges such as affecting the entire image when modifying a detail and slow generation speed, which hinders real-time interaction [1][2]. Group 1: Introduction of VAREdit - HiDream.ai has introduced a new self-regressive image editing framework called VAREdit, which aims to address the challenges faced by existing models [2][3]. - VAREdit incorporates a Visual Autoregressive (VAR) architecture that significantly enhances editing accuracy and generation speed, marking a new phase in image editing [3][5]. Group 2: Technical Details - VAREdit defines image editing as a next-scale prediction problem, generating the next-scale target feature residuals autoregressively for precise image editing [5]. - The model encodes image representations into multi-scale residual visual token sequences, allowing for the accumulation of features through codebook queries and upsampling operations [6]. Group 3: Model Design Challenges - A core challenge in designing VAREdit is integrating source image information into the backbone network as reference information for target scale generation [12]. - Two initial organizational schemes were explored: full-scale conditions, which increased computational costs, and maximum-scale conditions, which led to scale mismatches [13][14]. Group 4: Scale Alignment Reference Module - The Scale Alignment Reference (SAR) module was proposed as a hybrid solution, providing multi-scale alignment references in the first layer while focusing on the finest scale features in subsequent layers [17]. - This approach enhances the model's performance by allowing for better attention distribution across different scales [15]. Group 5: Benchmark Performance - VAREdit has shown outstanding performance in benchmark tests, outperforming competitors in both CLIP and GPT metrics, indicating superior editing accuracy [18][19]. - The VAREdit-8.4B model improved the GPT-Balance metric by 41.5% compared to ICEdit and 30.8% compared to UltraEdit, while the lightweight VAREdit-2.2B also achieved significant improvements [19]. Group 6: Speed and Efficiency - VAREdit demonstrates a clear advantage in speed, with the 8.4B model completing edits in 1.2 seconds for a 512×512 image, making it 2.2 times faster than similar diffusion models [20]. - The 2.2B model requires only 0.7 seconds, providing an instant editing experience while maintaining high quality [20]. Group 7: Versatility and Future Directions - VAREdit is versatile, achieving the best results across most editing types, with larger models compensating for smaller models' shortcomings in global style and text editing [23]. - The HiDream.ai team plans to continue exploring next-generation multi-modal image editing architectures to enhance quality, speed, and controllability in instruction-guided image generation technology [27].