GenExam

Search documents
Nano Banana不及格,开源模型一分难求!上海AI Lab新基准直击文生图模型痛点
量子位· 2025-09-24 03:32
Core Viewpoint - The article discusses the introduction of GenExam, a new benchmark for evaluating the capabilities of text-to-image models in generating accurate and contextually relevant diagrams across multiple disciplines, highlighting the current limitations of even the top models in this area [2][7][23]. Group 1: GenExam Overview - GenExam is the first multidisciplinary text-to-image examination benchmark, developed by a collaboration of several prestigious institutions, aiming to redefine the capabilities of text-to-image models [2][4][8]. - The benchmark includes 1,000 carefully selected questions across 10 disciplines, focusing specifically on diagram-related tasks, and is designed to assess the models' understanding, reasoning, and drawing capabilities [4][8][10]. Group 2: Evaluation Results - The results from the GenExam reveal that even the top models, such as GPT-4o, achieved a mere 12.1% accuracy under strict grading, while open-source models scored close to zero [5][19]. - The evaluation criteria include semantic correctness and visual reasonableness, with a dual scoring system that allows for both strict and lenient assessments [14][19]. Group 3: Model Performance Analysis - A total of 18 mainstream models were tested, revealing significant performance gaps between closed-source and open-source models, particularly in semantic correctness and visual accuracy [16][17]. - The best-performing closed-source model, GPT-Image-1, still fell short with a strict score of only 12.1%, indicating that while models can generate basic structures, they often miss critical details [19][22]. Group 4: Implications for Future Development - The findings from GenExam suggest that current models need to improve in knowledge integration, logical reasoning, and precise generation to transition from general image generation tools to specialized domain assistants [23][24]. - The benchmark sets a new goal for models to focus on generating correct rather than merely aesthetically pleasing images, marking a significant shift in the evaluation of AI capabilities [23][24].