Workflow
字节开源图像编辑黑科技!1/30参数1/13数据,性能提升9.19%
量子位·2025-05-07 09:33

Core Viewpoint - ByteDance has developed a new image editing method that improves performance by 9.19% compared to the current state-of-the-art (SOTA) methods, utilizing only 1/30 of the training data and 1/13 of the model parameter size [1]. Group 1: Methodology and Innovation - The new method does not require additional pre-training tasks or architectural modifications, relying instead on powerful multimodal models like GPT-4o to correct editing instructions [2]. - This approach addresses the issue of noisy supervisory signals in existing image editing models by constructing more effective editing instructions to enhance editing outcomes [3][9]. - The data and model have been made open-source on GitHub [4]. Group 2: Challenges in AI Image Editing - AI models often misinterpret instructions, such as changing the color of a boy's tie, which can lead to unintended alterations in skin tone or clothing [6]. - The team identified that existing image editing datasets contain a significant amount of noisy supervisory signals due to the automated methods used for dataset construction, leading to mismatches between instructions and image pairs [10][11][12]. Group 3: Training and Supervision - SuperEdit focuses on improving the quality of supervisory signals rather than merely increasing parameter size or pre-training computational power [13]. - The team utilized GPT-4o to generate more accurate editing instructions by observing differences between original and edited images [17]. - A comparative supervision mechanism was established to ensure the model learns the subtle differences between correct and incorrect editing instructions, enhancing its ability to understand and execute commands [22][23]. Group 4: Performance Metrics - SuperEdit demonstrated outstanding performance in multiple benchmark tests, achieving an overall accuracy of 69.7% and a score of 3.91 in the Real-Edit benchmark, surpassing the previous SOTA method SmartEdit, which had an accuracy of 58.3% and a score of 3.59 [25][28]. - The model was trained using a triplet loss function to distinguish between correct and incorrect editing instructions [27]. Group 5: Future Directions - The team plans to expand this data-prioritized approach to more visual generation tasks and explore possibilities of combining it with larger models [31].