Workflow
知识类型视角切入,全面评测图像编辑模型推理能力:所有模型在「程序性推理」方面表现不佳
量子位·2025-06-13 05:07

Core Viewpoint - The article discusses the development of KRIS-Bench, a benchmark for evaluating the reasoning capabilities of image editing models, focusing on the structured knowledge acquisition process similar to human learning [2][3][16]. Group 1: KRIS-Bench Overview - KRIS-Bench is a collaborative effort involving multiple prestigious institutions aimed at assessing AI's reasoning abilities in image editing [2]. - The benchmark categorizes knowledge into three types: Factual Knowledge, Conceptual Knowledge, and Procedural Knowledge, allowing AI to face progressively complex editing challenges [4][8]. - It features 7 reasoning dimensions and 22 typical editing tasks, ranging from basic to advanced difficulty levels [6]. Group 2: Evaluation Metrics - KRIS-Bench introduces a four-dimensional automated evaluation system to score editing outputs: Visual Consistency, Visual Quality, Instruction Following, and Knowledge Plausibility [10][11][13]. - The evaluation process includes a total of 1,267 image-instruction pairs, meticulously curated by experts to ensure diverse data sources [12]. Group 3: Model Performance Insights - The benchmark tests 10 models, including 3 closed-source and 7 open-source models, revealing performance gaps particularly in procedural reasoning and natural science tasks [14][16]. - Closed-source models like GPT-Image-1 lead in performance, while open-source models like BAGEL-Think show improvements in knowledge plausibility through enhanced reasoning processes [17].