SpineBench
Search documents
解放军总医院联合南大、吉大等机构,共同提出首个「脊柱诊疗大模型」SpineGPT
机器之心· 2025-11-22 09:00
Core Insights - The research led by the PLA General Hospital, in collaboration with top hospitals and universities, has developed the first large model specifically for spinal diagnosis, addressing a significant gap in AI-assisted clinical decision-making [2][3][10]. Group 1: Clinical Challenges and Solutions - Spinal diseases affect 619 million people globally and are a major cause of disability, yet existing AI models face a "cognitive gap" in clinical decision-making due to a lack of level-aware, multimodal data [2][6]. - The study introduces a comprehensive solution with the SpineMed-450K dataset, which is the first large-scale, traceable spinal instruction dataset, and the SpineBench clinical evaluation benchmark [3][18]. Group 2: Model Performance and Evaluation - The SpineGPT model, trained on the SpineMed-450K dataset, significantly outperforms leading open-source models, achieving an average score of 87.44%, surpassing models like Qwen2.5-VL-72B and GLM-4.5V [25][26]. - In the SpineBench evaluation, the performance gap of existing models was highlighted, with Qwen2.5-VL-72B scoring only 79.88% on average, while the proprietary model Gemini-2.5-Pro scored 89.23% [13][25]. Group 3: Data and Methodology - The SpineMed-450K dataset includes over 450,000 instruction instances sourced from textbooks, surgical guidelines, expert consensus, and de-identified real cases from 11 hospitals, ensuring diverse patient representation [14][16]. - The data generation process involved a rigorous "Clinician-in-the-loop" approach, ensuring high-quality instruction data through clinician involvement in the drafting and revision stages [14][24]. Group 4: Clinical Relevance and Future Directions - SpineBench serves as a clinically significant evaluation framework, assessing AI's performance in fine-grained, anatomy-centered reasoning, which is crucial for practical applications [18][20]. - The research team plans to expand the dataset, train models with more than 7 billion parameters, and incorporate reinforcement learning techniques to further enhance model performance and establish clearer benchmarks [30].