比NanoBanana更擅长中文和细节控制！兔展&北大Uniworld V2刷新SOTA

Core Viewpoint - The article introduces UniWorld-V2, a new image editing model that excels in detail and understanding of Chinese language instructions, outperforming previous models like Nano Banana [1][4][6]. Group 1: Model Features - UniWorld-V2 demonstrates superior fine control in image editing, achieving results that surpass those of SFT models [11]. - The model can accurately interpret complex Chinese characters and phrases, showcasing its proficiency in rendering artistic fonts [11]. - Users can specify editing areas through bounding boxes, allowing for precise operations like moving objects out of designated areas [14]. - The model effectively understands commands such as "re-light the scene," integrating objects naturally into the environment with high light and shadow coherence [15]. Group 2: Technical Innovations - The core innovation behind UniWorld-V2 is the UniWorld-R1 framework, which applies reinforcement learning (RL) strategies to image editing [18]. - UniWorld-R1 is the first unified architecture based on RL, utilizing Diffusion Negative-aware Finetuning (DiffusionNFT) for efficient training without likelihood estimation [19]. - The framework employs a multi-modal large language model (MLLM) as a reward model, enhancing the model's alignment with human intentions through implicit feedback [19]. Group 3: Performance Metrics - In benchmark tests, UniWorld-V2 achieved a score of 7.83 in GEdit-Bench, surpassing GPT-Image-1 (7.53) and Gemini 2.0 (6.32) [24]. - The model also led in ImgEdit with a score of 4.49, outperforming all known models [24]. - The method significantly improved the performance of foundational models, with FLUX.1-Kontext's score rising from 3.71 to 4.02, and Qwen-Image-Edit's score increasing from 4.35 to 4.48 [25]. Group 4: Generalization and User Preference - UniWorld-R1 demonstrated strong generalization capabilities, improving FLUX.1-Kontext's score from 6.00 to 6.74 in GEdit-Bench [26]. - User preference studies indicated that participants favored UniWorld-FLUX.1-Kontext for its superior instruction alignment and editing capabilities, despite a slight edge in image quality for the official model [27]. Group 5: Historical Context - UniWorld-V2 builds upon the earlier UniWorld-V1, which was the first unified understanding and generation model, released three months ahead of notable models like Google’s Nano Banana [29].