Workflow
大模型全员0分!谢赛宁领衔华人团队,最新编程竞赛基准出炉,题目每日更新禁止刷题
量子位·2025-06-18 09:17

Core Viewpoint - The recent LiveCodeBench Pro benchmark test revealed that leading large language models (LLMs) performed poorly, with all models scoring zero points, indicating that they have not yet reached the level of human experts in competitive programming tasks [1][2][8]. Group 1: Benchmark Overview - LiveCodeBench Pro is a real-time benchmark testing platform that includes competitive programming problems from IOI, Codeforces, and ICPC [3]. - The question bank is updated daily to prevent LLMs from memorizing questions, ensuring a challenging evaluation environment [4][15]. - The benchmark consists of 584 top-tier competition problems, categorized by cognitive focus and difficulty level, with automatic selection based on normal distribution [15][17]. Group 2: Model Performance - The best-performing model achieved a pass rate of only 53% on medium difficulty questions, while the pass rate for hard questions was 0% [9][10]. - The performance metrics of various models showed that while they excelled in knowledge-intensive and logic-intensive problems, they struggled with observation-intensive problems [26][29]. - LLMs demonstrated advanced skills in precise implementations but fell short in algorithm design and complex case analysis [28][29]. Group 3: Testing Methodology - The testing team categorized problems based on underlying algorithmic concepts and recorded the official difficulty ratings from Codeforces [19]. - Each model's submissions were evaluated against human expert solutions, with results indicating that LLMs often failed to utilize provided sample inputs effectively [30][32]. - The team plans to release a completely new evaluation set quarterly to maintain the relevance and challenge of the testing environment [38]. Group 4: Team Composition - The LiveCodeBench Pro team consists of several Olympic competition winners, with a significant portion being of Chinese descent [40]. - Key team members have backgrounds in prestigious institutions and have previously interned at major tech companies, contributing to the project's credibility and expertise [41][44].