Workflow
Seedream 4.0大战Nano Banana、GPT-4o?EdiVal-Agent 终结图像编辑评测
机器之心·2025-10-24 06:26

Core Insights - The article discusses the emergence of EdiVal-Agent, an automated, fine-grained evaluation framework for multi-turn image editing, which is becoming crucial for assessing multimodal models' understanding, generation, and reasoning capabilities [2][7]. Evaluation Methods - Current mainstream evaluation methods fall into two categories: 1. Reference-based evaluations rely on paired reference images, which have limited coverage and may inherit biases from older models [6]. 2. VLM-based evaluations use visual language models to score based on prompts, but they struggle with spatial understanding, detail sensitivity, and aesthetic judgment, leading to unreliable quality assessments [6]. EdiVal-Agent Overview - EdiVal-Agent is an object-centric automated evaluation agent that can recognize each object in an image, understand editing semantics, and dynamically track changes during multi-turn editing [8][17]. Workflow of EdiVal-Agent 1. Object Recognition: EdiVal-Agent first identifies all visible objects in an image and generates structured descriptions, creating an object pool for subsequent instruction generation and evaluation [17]. 2. Instruction Generation: It automatically generates multi-turn editing instructions covering nine editing types and six semantic categories, allowing for dynamic maintenance of object pools [18][19]. 3. Automated Evaluation: EdiVal-Agent evaluates model performance from three dimensions: instruction following, content consistency, and visual quality, with a final composite score (EdiVal-O) derived from geometric averages of the first two metrics [20][22]. Performance Metrics - EdiVal-IF measures how accurately models follow instructions, while EdiVal-CC assesses the consistency of unedited content. EdiVal-VQ, which evaluates visual quality, is not included in the final score due to its subjective nature [25][28]. Human Agreement Study - EdiVal-Agent's evaluation results show an average agreement rate of 81.3% with human judgments, significantly outperforming traditional methods [31][32]. Model Comparison - EdiVal-Agent compared 13 representative models, revealing that Seedream 4.0 excels in instruction following, while Nano Banana balances speed and quality effectively. GPT-Image-1 ranks third due to its focus on aesthetics at the expense of consistency [36][37].