OpenAI测试称GPT-5媲美专家

Core Insights - OpenAI's GPT-5 model and Anthropic's Claude Opus 4.1 are reported to be approaching the quality of work produced by industry experts, according to a new benchmark test called GDPval [1][2] - The GDPval test evaluates AI systems' performance in economic value work, which is crucial for developing Artificial General Intelligence (AGI) [1] - The test covers 44 occupations across nine major industries contributing to the US GDP, including healthcare, finance, manufacturing, and government [1] Group 1 - The initial version of GDPval-v0 involved senior professionals comparing AI-generated reports with those from human experts, calculating the average "win rate" of AI models [2] - GPT-5-high was rated as superior or on par with industry experts in 40.6% of cases, while Claude Opus 4.1 achieved a 49% rating, indicating a stronger performance [2] - OpenAI acknowledges that the current GDPval test only assesses a limited aspect of professional work, with plans to develop more comprehensive tests in the future [2] Group 2 - OpenAI's Chief Economist, Aaron Chatterji, stated that the results suggest professionals can save time using AI models, allowing them to focus on more meaningful tasks [3] - Tejal Patwardhan, the evaluation lead, expressed optimism about the progress of GDPval, noting that GPT-4o's score was only 13.7% about 15 months ago, while GPT-5's score has nearly tripled [3] - The trend of improving AI capabilities is expected to continue, enhancing the potential for AI to assist in various professional tasks [3]