DAD方法

Search documents
为防AI刷题,Nature等顶刊最新封面被做成数据集,考验模型科学推理能力
3 6 Ke· 2025-08-26 01:25
Core Insights - The emergence of advanced multimodal models like GPT-4o and Gemini 2.5 Pro has raised concerns about the evaluation of AI capabilities as existing "question banks" become outdated [1][17] - A new dynamic benchmark called MAC (Multimodal Academic Cover) has been proposed to continuously assess AI using the latest scientific content [1][20] Group 1: Benchmark Development - The MAC benchmark utilizes the latest covers from 188 top journals, including Nature, Science, and Cell, to create a testing dataset from over 25,000 image-text pairs [3][20] - The benchmark aims to evaluate whether multimodal models can understand the deep connections between artistic visual elements and scientific concepts [3][20] Group 2: Testing Methodology - The MAC benchmark includes two testing tasks designed to prevent AI from relying on superficial visual features: selecting corresponding texts from journal covers and matching images to cover stories [6][14] - The design incorporates "semantic traps" to ensure that only models with a true understanding of scientific concepts can select the correct answers [6][14] Group 3: Model Performance - The top-performing model, Step-3, achieved an accuracy of only 79.1% on the MAC benchmark, highlighting a significant gap compared to its near-perfect performance on other benchmarks [4][16] - Open-source model Qwen2.5-VL-7B showed an accuracy of just 56.8%, indicating limitations in current AI models when faced with the latest scientific content [4][16] Group 4: Continuous Challenge Mechanism - The MAC benchmark employs a dual dynamic mechanism to ensure ongoing challenges: dynamic data that evolves with scientific knowledge and dynamic problem construction that utilizes advanced embedding models to create more sophisticated semantic traps [20][22][23] - This approach allows the benchmark to remain relevant and challenging as both scientific knowledge and AI capabilities advance [20][22][23] Group 5: Future Directions - The research team plans to expand the MAC benchmark to include more scientific journals and other forms of dynamic scientific content, such as conference papers and science news [23] - The benchmark will undergo annual updates to adapt to the rapid advancements in AI technology, ensuring it remains a relevant tool for evaluating AI capabilities [23]