为防AI刷题，Nature等顶刊最新封面被做成数据集，考验模型科学推理能力|上海交通大学

Core Viewpoint - The article discusses the development of the MAC (Multimodal Academic Cover) benchmark, which aims to evaluate the true capabilities of advanced AI models like GPT-4o and Gemini 2.5 Pro by using the latest scientific content for testing, addressing the challenge of outdated "question banks" in AI assessments [1][5]. Group 1: Benchmark Development - The MAC benchmark utilizes the latest covers from 188 top journals, including Nature, Science, and Cell, to create a testing dataset from over 25,000 image-text pairs, ensuring that the AI models are evaluated on the most current and complex scientific concepts [3][4]. - The research team designed two testing tasks: "selecting text from images" and "selecting images from text," to assess the AI's understanding of the deep connections between visual elements and scientific concepts [17][18]. Group 2: Testing Results - The results revealed that even top models like Step-3 achieved only a 79.1% accuracy when faced with the latest scientific content, indicating significant limitations in their performance compared to their near-perfect results on other benchmarks [4][19]. - The study highlighted that models such as GPT-5-thinking and Gemini 2.5 Pro, while proficient in visual recognition, still struggle with deep reasoning tasks that require cross-modal scientific understanding [19]. Group 3: Dynamic Benchmarking Mechanism - The MAC benchmark introduces a dynamic approach to testing by continuously updating the dataset and questions, which helps maintain the challenge level as scientific knowledge evolves [24][26]. - The research team conducted a comparison experiment showing that all models performed worse on the latest data (MAC-2025) compared to older data (MAC-Old), demonstrating that the natural evolution of scientific knowledge provides ongoing challenges for AI models [26]. Group 4: DAD Methodology - The DAD (Divide and Analyze) method was proposed to enhance AI performance by structuring the reasoning process into two phases: a detailed visual description followed by high-level analysis, simulating human expert thinking [21][22]. - This two-step approach significantly improved the accuracy of multiple models, showcasing the effectiveness of extending reasoning time in multimodal scientific understanding tasks [22][23]. Group 5: Future Prospects - The MAC benchmark is expected to evolve into a more comprehensive evaluation platform, with plans to include more scientific journals and dynamic scientific content such as conference papers and news [28]. - As AI capabilities approach human levels, the MAC benchmark will serve as a "touchstone" to better understand the boundaries of AI capabilities and the path toward true intelligence [28].