人工智能评估
Search documents
谁是最强“打工AI”?OpenAI亲自测试,结果第一不是自己
量子位· 2025-09-26 04:56
Core Insights - OpenAI has introduced a new benchmark called GDPval to evaluate the economic value of AI models in real-world tasks, covering 44 occupations that contribute a total of $3 trillion annually to the U.S. GDP [2][15] - Claude Opus 4.1 emerged as the best-performing model, with 47.6% of its outputs rated comparable to human expert results, while GPT-5 followed with 38.8% [4][6] - OpenAI's models show linear performance improvement over generations, with significant advancements in task accuracy and aesthetic capabilities [32][33] Benchmark Overview - GDPval focuses on nine key industries contributing over 5% to the U.S. GDP, selecting occupations primarily involving numerical tasks [14] - A total of 44 occupations were identified, with an average of 14 years of experience among the recruited industry experts who designed the tasks [15][18] - The tasks are based on real work outcomes, requiring an average of 7 hours to complete, with some complex tasks taking weeks [19] Evaluation Methodology - OpenAI employed a blind expert pairwise comparison method for task evaluation, achieving a 66% consistency rate with human expert ratings [26][27] - Each task underwent multiple rounds of human expert review, ensuring high quality and relevance [23][24] Model Performance - The evaluation revealed that GPT-5 excels in accuracy for text-based tasks, while Claude demonstrates superior performance in handling various file formats, showcasing strong visual perception and design capabilities [33] - OpenAI noted that combining AI models with human oversight could lead to more cost-effective and efficient task completion [35][36] Limitations and Future Plans - GDPval has limitations, including a small dataset of only 44 occupations and a focus on knowledge work that excludes physical labor [40] - OpenAI plans to expand GDPval's scope and enhance its realism and interactivity in future iterations [41]