斯坦福临床医疗AI横评，DeepSeek把谷歌OpenAI都秒了

Core Insights - The article discusses the comprehensive evaluation of large language models (LLMs) for medical tasks, highlighting that DeepSeek R1 achieved a 66% win rate, outperforming other models in a clinical context [1][7][24]. Evaluation Framework - A comprehensive assessment framework named MedHELM was developed, consisting of 35 benchmark tests covering 22 subcategories of medical tasks [12][20]. - The classification system was validated by 29 practicing clinicians from 14 medical specialties, ensuring its relevance to real-world clinical activities [4][17]. Model Performance - DeepSeek R1 led the evaluation with a 66% win rate and a macro average score of 0.75, indicating its superior performance across the benchmark tests [7][24]. - Other notable models included o3-mini with a 64% win rate and Claude 3.7 Sonnet with a 64% win rate, while models like Gemini 1.5 Pro ranked lowest with a 24% win rate [26][27]. Benchmark Testing - The evaluation included 17 existing benchmarks and 13 newly developed tests, with 12 of the new tests based on real electronic health record data [21][20]. - The models showed varying performance across different task categories, with higher scores in clinical case generation and patient communication tasks compared to structured reasoning tasks [32]. Cost-Effectiveness Analysis - A cost analysis was conducted based on the token consumption during the evaluation, revealing that non-reasoning models like GPT-4o mini had lower costs compared to reasoning models like DeepSeek R1 [38][39]. - The analysis indicated that models like Claude 3.5 Sonnet and Claude 3.7 Sonnet provided good value for their performance at lower costs [39].