Workflow
Clinical Medicine
icon
Search documents
让科研人员不再做牛马!斯坦福大学华人团队打造首个通用生物医学AI智能体,从设计实验、数据分析到药物发现全自动搞定
生物世界· 2025-06-10 08:21AI Processing
编辑丨王多鱼 排版丨水成文 生物医学研究是增进人类对健康和疾病的理解、推动药物研发以及提升临床护理水平的基础。 然而,在生物医学实验室中,科研人员往往被复杂的实验方案、庞大的数据库、五花八门的分析工具以及不停更新的海量文献所淹没。生物医学研究日益受到这 些重复且分散的工作流程的制约,让科研人员疲于奔命, 严重减缓了科学发现的速度,限制了科学创新。这凸显了科学界对根本性新方法的迫切需求——一种能 够 有效扩展科学专业知识、简化研究工作流程,并充分释放生物医学研究潜力的全新路径。 2025 年 6 月 2 日, 斯坦福大学 黄柯鑫 、 Serena Zhang 、 王瀚宸 、 屈元昊 、 陆荧洲 等研究人员领衔的团队,联合 Genentech、Arc Institute、 加州大学 旧金山分校及 普林斯顿大学等 多个顶尖研究机构,发布了一款 通用生物医学 AI 智能体 —— Biomni ,该智能体能够自主完成横跨遗传学、基因组学、微生物 学、药理学和临床医学等多个生物医学分支领域的复杂研究任务 。 Biomni 的诞生标志着 AI 在生物医学研究中从"工具使用者"向"自主决策者"的跃迁 。通过将分散的科研资源整 ...
斯坦福临床医疗AI横评,DeepSeek把谷歌OpenAI都秒了
量子位· 2025-06-03 06:21
Core Insights - The article discusses the comprehensive evaluation of large language models (LLMs) for medical tasks, highlighting that DeepSeek R1 achieved a 66% win rate, outperforming other models in a clinical context [1][7][24]. Evaluation Framework - A comprehensive assessment framework named MedHELM was developed, consisting of 35 benchmark tests covering 22 subcategories of medical tasks [12][20]. - The classification system was validated by 29 practicing clinicians from 14 medical specialties, ensuring its relevance to real-world clinical activities [4][17]. Model Performance - DeepSeek R1 led the evaluation with a 66% win rate and a macro average score of 0.75, indicating its superior performance across the benchmark tests [7][24]. - Other notable models included o3-mini with a 64% win rate and Claude 3.7 Sonnet with a 64% win rate, while models like Gemini 1.5 Pro ranked lowest with a 24% win rate [26][27]. Benchmark Testing - The evaluation included 17 existing benchmarks and 13 newly developed tests, with 12 of the new tests based on real electronic health record data [21][20]. - The models showed varying performance across different task categories, with higher scores in clinical case generation and patient communication tasks compared to structured reasoning tasks [32]. Cost-Effectiveness Analysis - A cost analysis was conducted based on the token consumption during the evaluation, revealing that non-reasoning models like GPT-4o mini had lower costs compared to reasoning models like DeepSeek R1 [38][39]. - The analysis indicated that models like Claude 3.5 Sonnet and Claude 3.7 Sonnet provided good value for their performance at lower costs [39].