OpenAI开源HealthBench，60个国家合力开发5000段真实对话

Core Insights - OpenAI has released a new medical-focused evaluation dataset called HealthBench, which consists of 5,000 core test dialogues created by 262 doctors from 60 countries, enhancing the dataset's difficulty, authenticity, and richness [1] - The performance of large models in healthcare has significantly improved, with GPT-3.5Turbo scoring 16%, GPT-4o at 32%, and o3 achieving 60%, indicating substantial progress overall [1] - Smaller models have shown remarkable advancements, with GPT-4.1nano outperforming GPT-4o while reducing costs by 25 times [1]