AI医生终于有了硬标尺！全球首个专病循证评测框架GAPS发布，蚂蚁联合北大王俊院士团队出品

Core Viewpoint - The article discusses the launch of the GAPS (Grounding, Adequacy, Perturbation, Safety) evaluation framework for assessing the clinical capabilities of AI models in the medical field, specifically focusing on lung cancer [1][2][10]. Group 1: GAPS Framework Overview - GAPS is the world's first evaluation framework for AI clinical capabilities, developed in collaboration with a team of thoracic surgeons and led by Professor Wang Jun from Peking University People's Hospital [1][4]. - The framework addresses the limitations of existing medical AI assessments, which often rely on exam-style questions and lack comprehensive evaluation of clinical depth, integrity, robustness, and safety [2][7][10]. - GAPS includes a fully automated evaluation toolchain that generates questions, scoring criteria, and multi-dimensional scoring, focusing on 92 questions covering 1691 clinical points in lung cancer [2][18]. Group 2: Evaluation Dimensions - GAPS breaks down clinical competence into four orthogonal dimensions: 1. Grounding (G): Depth of understanding beyond mere facts, requiring reasoning and decision-making [11]. 2. Adequacy (A): Completeness of responses, with a three-tier evaluation system for essential, conditional, and additional recommendations [12][31]. 3. Perturbation (P): Robustness against real-world uncertainties, tested through various perturbation scenarios [13][34]. 4. Safety (S): Establishing a risk framework to ensure that medical AI does not produce harmful recommendations, with a strict penalty for catastrophic errors [16][36]. Group 3: Technological Innovations - GAPS features an end-to-end automated evaluation pipeline that generates high-quality assessment sets based on clinical guidelines, allowing for rapid expansion into other medical specialties [17][19]. - The framework utilizes advanced techniques such as evidence-based knowledge graphs and virtual patient generation to ensure that each question is grounded in reliable clinical evidence [20][23]. Group 4: Performance Insights - Initial evaluations of leading AI models using GAPS revealed significant performance gaps, particularly in handling uncertainty and providing comprehensive clinical recommendations [29][31]. - The results indicated that while models excelled in factual recall, they struggled with complex decision-making and reasoning under uncertainty, highlighting the need for further development in AI clinical capabilities [29][30]. Group 5: Future Implications - The introduction of GAPS marks a paradigm shift in medical AI evaluation from mere exam scores to assessing clinical competence, emphasizing the importance of evidence-grounded reasoning and uncertainty management in future AI developments [39][40].