ICML2025｜清华医工平台提出大模型「全周期」医学能力评测框架MultiCogEval

Core Viewpoint - The rapid development of Large Language Models (LLMs) is significantly reshaping the healthcare industry, with these models becoming a new battleground for advanced technology [2][3]. Group 1: Medical Language Models and Their Capabilities - LLMs possess strong text understanding and generation capabilities, enabling them to read medical literature, interpret medical records, and even generate preliminary diagnostic suggestions based on patient statements, thereby assisting doctors in improving diagnostic accuracy and efficiency [2][3]. - Despite achieving over 90% accuracy on medical question-answering benchmarks like MedQA, the practical application of these models in real clinical settings remains suboptimal, indicating a "high score but low capability" issue [4][5]. Group 2: MultiCogEval Framework - The MultiCogEval framework was introduced to evaluate LLMs across different cognitive levels, addressing the gap between medical knowledge mastery and clinical problem-solving capabilities [5][6][10]. - This framework assesses LLMs' clinical abilities at three cognitive levels: basic knowledge mastery, comprehensive knowledge application, and scenario-based problem-solving [12][14]. Group 3: Evaluation Results - Evaluation results show that while LLMs perform well in low-level tasks (basic knowledge mastery) with accuracy exceeding 60%, their performance declines significantly in mid-level tasks (approximately 20% drop) and further deteriorates in high-level tasks, with the best model achieving only 19.4% accuracy in full-chain diagnosis [16][17]. - The study found that fine-tuning in the medical domain effectively enhances LLMs' low and mid-level clinical capabilities, with improvements up to 15%, but has limited impact on high-level task performance [19][22]. Group 4: Future Implications - The introduction of the MultiCogEval framework lays a solid foundation for future research and development of medical LLMs, aiming to promote more robust, reliable, and practical applications of AI in healthcare, ultimately contributing to the creation of "trustworthy AI doctors" [21][22].