Workflow
图像编辑模型
icon
Search documents
比NanoBanana更擅长中文和细节控制,兔展&北大Uniworld V2刷新SOTA
3 6 Ke· 2025-11-05 09:44
Core Insights - The article introduces UniWorld-V2, a new image editing model that excels in detail and understanding of Chinese language prompts, outperforming existing models like Nano Banana [1][6][24] - The model is developed by the UniWorld team from Tujia Intelligent and Peking University, utilizing a novel training framework called UniWorld-R1, which incorporates reinforcement learning for enhanced image editing capabilities [5][14] Model Performance - UniWorld-V2 achieved state-of-the-art (SOTA) results in industry benchmark tests such as GEdit-Bench and ImgEdit, surpassing top closed-source models like OpenAI's GPT-Image-1 [5][19] - In GEdit-Bench, UniWorld-V2 scored 7.83, significantly higher than GPT-Image-1's 7.53 and Gemini 2.0's 6.32 [19][21] - The model also led in ImgEdit with a score of 4.49, outperforming all known open-source and closed-source models [19] Technical Innovations - The core innovation of UniWorld-R1 is its use of reinforcement learning for image editing, addressing issues of overfitting and generalization found in traditional supervised fine-tuning models [14] - UniWorld-R1 employs Diffusion Negative-aware Finetuning (DiffusionNFT) for efficient training and utilizes a multi-modal large language model (MLLM) as a reward model, enhancing alignment with human intent [14][15] User Experience and Functionality - UniWorld-V2 demonstrates superior control in editing tasks, such as accurately rendering complex Chinese characters and executing precise spatial edits based on user-defined areas [9][10][11] - The model's ability to understand and apply lighting adjustments to scenes results in a more cohesive and harmonious image output [12] Generalization and Versatility - The UniWorld-R1 framework shows strong generalization capabilities, improving the performance of other models like FLUX.1-Kontext and Qwen-Image-Edit when applied [20][22] - User preference studies indicate that UniWorld-FLUX.1-Kontext is favored for its instruction alignment and editing capabilities, despite slightly lower image quality compared to stronger official versions [23] Availability and Future Research - The UniWorld-R1 framework, along with its paper, code, and models, has been made publicly available on GitHub and Hugging Face to support further research in the field [24]
谷歌认领最强AI版Photoshop!现在人人可用,效果确实强悍
量子位· 2025-08-27 05:49
Core Viewpoint - The article discusses the unveiling of Google's image editing model, originally known as "nano-banana," which is now officially recognized as Gemini 2.5 Flash Image, highlighting its advanced capabilities and potential impact on the image editing industry [1]. Group 1: Model Capabilities - The Gemini 2.5 Flash Image model can be accessed for free on Gemini and Google AI Studio, with an API available at a cost of $0.039 per image [8]. - It showcases exceptional image editing abilities, allowing users to merge up to three images to create new visuals and generate surreal art by combining different photo elements seamlessly [12][13]. - The model can also transform 2D images into 3D perspectives, providing a coherent view from multiple angles [22]. - It demonstrates image reasoning capabilities, such as recognizing structures and calculating angles, which enhances its functionality beyond traditional editing [25][26]. Group 2: User Engagement and Reactions - Prior to the official announcement, users were already captivated by the model's performance, leading to extensive discussions and creative use cases on social media [16]. - Users have expressed concerns about the implications for traditional editing software like Photoshop, given the model's advanced features [21]. - The model's ability to understand and replicate light and shadow details has been noted, with examples showing realistic shadow casting in generated images [35][37]. Group 3: Development and Launch Strategy - The model gained popularity through its initial anonymous release on LMArena, a platform for AI model competitions, which contributed to its viral status [41][42]. - Speculations about its connection to Google were fueled by its performance similarities to other models in the Gemini family and the stealthy launch strategy reminiscent of DeepMind's earlier practices [45][46]. - Google explained the stealth launch as a preparation for a global-scale release, indicating a strategic approach to market entry [48].
性能媲美GPT-4o 和 Gemini2 Flash,阶跃星辰开源通用图像编辑模型Step1X-Edit
AI科技大本营· 2025-04-27 07:12
首创 MLLM 与 DiT 深度融合,阶跃星辰发布开源图像编辑模型 Step1X-Edit。 整理 | 梦依丹 出品丨AI 科技大本营(ID:rgznai100) 在图像编辑领域,开源模型正在加速追赶顶级闭源模型。近日,阶跃星辰正式发布并开源了图像编辑大模型 Step1X-Edit,在性能上达到当前开源体系 的 SOTA 水平,且性能可与 GPT-4o 与 Gemini 2 Flash 等闭源模型相媲美。 GEdit‑Bench 中每个子任务的 VIEScore,所有结果均由 GPT‑4o 评估 Step1X-Edit 由 19B 参数构成(7B 多模态语言模型 MLLM + 12B 扩散图像 Transformer DiT),具备语义精准解析、身份一致性保持和高精度区域级 控制三项核心能力。模型支持包括文字替换、风格迁移、材质变换、人物修图在内的 11 类高频图像编辑任务,能够灵活应对复杂的编辑指令。 在技术路径上,Step1X-Edit 首次在开源体系中实现了多模态语言理解与扩散图像生成的深度融合。模型能够解析参考图像与用户编辑指令,提取潜在 嵌入,并与扩散式图像解码器协同工作,生成符合预期的高质量编辑图 ...