科学概念理解

Search documents
为防AI刷题,Nature等顶刊最新封面被做成数据集,考验模型科学推理能力
3 6 Ke· 2025-08-26 01:25
Core Insights - The emergence of advanced multimodal models like GPT-4o and Gemini 2.5 Pro has raised concerns about the evaluation of AI capabilities as existing "question banks" become outdated [1][17] - A new dynamic benchmark called MAC (Multimodal Academic Cover) has been proposed to continuously assess AI using the latest scientific content [1][20] Group 1: Benchmark Development - The MAC benchmark utilizes the latest covers from 188 top journals, including Nature, Science, and Cell, to create a testing dataset from over 25,000 image-text pairs [3][20] - The benchmark aims to evaluate whether multimodal models can understand the deep connections between artistic visual elements and scientific concepts [3][20] Group 2: Testing Methodology - The MAC benchmark includes two testing tasks designed to prevent AI from relying on superficial visual features: selecting corresponding texts from journal covers and matching images to cover stories [6][14] - The design incorporates "semantic traps" to ensure that only models with a true understanding of scientific concepts can select the correct answers [6][14] Group 3: Model Performance - The top-performing model, Step-3, achieved an accuracy of only 79.1% on the MAC benchmark, highlighting a significant gap compared to its near-perfect performance on other benchmarks [4][16] - Open-source model Qwen2.5-VL-7B showed an accuracy of just 56.8%, indicating limitations in current AI models when faced with the latest scientific content [4][16] Group 4: Continuous Challenge Mechanism - The MAC benchmark employs a dual dynamic mechanism to ensure ongoing challenges: dynamic data that evolves with scientific knowledge and dynamic problem construction that utilizes advanced embedding models to create more sophisticated semantic traps [20][22][23] - This approach allows the benchmark to remain relevant and challenging as both scientific knowledge and AI capabilities advance [20][22][23] Group 5: Future Directions - The research team plans to expand the MAC benchmark to include more scientific journals and other forms of dynamic scientific content, such as conference papers and science news [23] - The benchmark will undergo annual updates to adapt to the rapid advancements in AI technology, ensuring it remains a relevant tool for evaluating AI capabilities [23]
为防AI刷题,Nature等顶刊最新封面被做成数据集,考验模型科学推理能力|上海交通大学
量子位· 2025-08-25 15:47
Core Viewpoint - The article discusses the development of the MAC (Multimodal Academic Cover) benchmark, which aims to evaluate the true capabilities of advanced AI models like GPT-4o and Gemini 2.5 Pro by using the latest scientific content for testing, addressing the challenge of outdated "question banks" in AI assessments [1][5]. Group 1: Benchmark Development - The MAC benchmark utilizes the latest covers from 188 top journals, including Nature, Science, and Cell, to create a testing dataset from over 25,000 image-text pairs, ensuring that the AI models are evaluated on the most current and complex scientific concepts [3][4]. - The research team designed two testing tasks: "selecting text from images" and "selecting images from text," to assess the AI's understanding of the deep connections between visual elements and scientific concepts [17][18]. Group 2: Testing Results - The results revealed that even top models like Step-3 achieved only a 79.1% accuracy when faced with the latest scientific content, indicating significant limitations in their performance compared to their near-perfect results on other benchmarks [4][19]. - The study highlighted that models such as GPT-5-thinking and Gemini 2.5 Pro, while proficient in visual recognition, still struggle with deep reasoning tasks that require cross-modal scientific understanding [19]. Group 3: Dynamic Benchmarking Mechanism - The MAC benchmark introduces a dynamic approach to testing by continuously updating the dataset and questions, which helps maintain the challenge level as scientific knowledge evolves [24][26]. - The research team conducted a comparison experiment showing that all models performed worse on the latest data (MAC-2025) compared to older data (MAC-Old), demonstrating that the natural evolution of scientific knowledge provides ongoing challenges for AI models [26]. Group 4: DAD Methodology - The DAD (Divide and Analyze) method was proposed to enhance AI performance by structuring the reasoning process into two phases: a detailed visual description followed by high-level analysis, simulating human expert thinking [21][22]. - This two-step approach significantly improved the accuracy of multiple models, showcasing the effectiveness of extending reasoning time in multimodal scientific understanding tasks [22][23]. Group 5: Future Prospects - The MAC benchmark is expected to evolve into a more comprehensive evaluation platform, with plans to include more scientific journals and dynamic scientific content such as conference papers and news [28]. - As AI capabilities approach human levels, the MAC benchmark will serve as a "touchstone" to better understand the boundaries of AI capabilities and the path toward true intelligence [28].