Core Viewpoint - The introduction of SuperGPQA, a new evaluation benchmark for large language models (LLMs), aims to address the limitations of existing models and provide a more comprehensive assessment of their capabilities [2][10][20]. Group 1: Limitations of Existing Models - Traditional evaluation benchmarks like MMLU and GPQA have become increasingly homogeneous, making it difficult to assess the true capabilities of models [1][8]. - These benchmarks typically cover fewer than 50 subjects, lacking diversity and long-tail knowledge, which limits their effectiveness [8][10]. - The accuracy of top models like GPT-4o has reached over 90% on traditional benchmarks, indicating a loss of differentiation in evaluating model performance [8][9]. Group 2: Introduction of SuperGPQA - SuperGPQA, developed by ByteDance's Doubao model team in collaboration with the M-A-P open-source community, covers 285 graduate-level subjects and includes 26,529 specialized questions [3][10]. - The evaluation framework was built over six months with contributions from nearly 100 scholars and engineers, ensuring a high-quality assessment process [2][6]. - The benchmark features a more challenging format with an average of 9.67 options per question, compared to the traditional 4-option format [10]. Group 3: Addressing Key Pain Points - SuperGPQA directly targets three major pain points in model evaluation: incomplete subject coverage, questionable question quality, and a lack of diverse evaluation dimensions [5][6]. - The benchmark employs a rigorous data construction process involving expert annotations, crowdsourced input, and collaborative validation with LLMs to ensure high-quality questions [6][11]. - The assessment includes a balanced distribution of question difficulty across various subjects, with 42.33% requiring mathematical calculations or rigorous reasoning [12]. Group 4: Performance Insights - In evaluations, even the strongest model, DeepSeek-R1, achieved only 61.82% accuracy on SuperGPQA, significantly lower than human graduate-level performance, which averages above 85% [4][20]. - The results indicate that while reasoning models dominate the leaderboard, their performance still lags behind human capabilities [17][20]. - The benchmark has been made publicly available on platforms like HuggingFace and GitHub, quickly gaining traction in the community [7][19]. Group 5: Future Implications - The development of SuperGPQA reflects ByteDance's commitment to enhancing model capabilities and addressing criticisms regarding its foundational work [22][24]. - The introduction of this benchmark may influence the future landscape of LLM evaluations, pushing for higher standards and more rigorous assessments [22][24].
DeepSeek-R1、o1都在及格线挣扎!字节开源全新知识推理测评集,覆盖285个学科
量子位·2025-03-04 04:51