思维链方法 - filings, earnings calls, financial reports, news

思维链方法

Search documents

谢赛宁团队新基准让LLM集体自闭，DeepSeek R1、Gemini 2.5 Pro都是零分

机器之心· 2025-06-18 09:34

Core Insights - The article discusses the significant gap between current LLMs (Large Language Models) and human expert-level performance in competitive programming [2][18]. - A new benchmark, LiveCodeBench Pro, was introduced to evaluate LLMs against high-quality programming problems sourced from top competitions [4][6]. Evaluation of LLMs - LLMs have shown impressive results in code generation, surpassing human averages in some benchmarks, particularly in competitive programming [2][12]. - However, when evaluated without external tools, the best-performing models achieved a pass rate of only 53% on medium difficulty problems and 0% on high difficulty problems [12][18]. Benchmark Details - LiveCodeBench Pro includes 584 high-quality problems from competitions like Codeforces, ICPC, and IOI, with continuous updates to mitigate data contamination [6][10]. - Problems are categorized by algorithm type, and the performance of models is analyzed based on their failure submissions [7][12]. Model Performance Analysis - The analysis revealed that LLMs perform well on implementation-heavy problems but struggle with complex algorithmic reasoning and edge case analysis [17][18]. - Knowledge-intensive and logic-intensive problems are areas where LLMs excel, while observation-intensive problems and case work present significant challenges [20][22][24]. Comparison with Human Performance - LLMs exhibit a higher rate of algorithmic logic errors compared to humans, while they make fewer implementation logic errors [27][30]. - The models' inability to handle edge cases and their reliance on external tools for high scores highlight their limitations in reasoning capabilities [17][30]. Impact of Multiple Attempts - Increasing the number of attempts (pass@k) significantly improves model performance, although high-difficulty problems remain unsolved [33][36]. - The difference in performance between models with terminal access and those without indicates that tool usage plays a crucial role in enhancing scores [34][36]. Reasoning Capability Comparison - Enabling reasoning capabilities in models leads to substantial improvements in performance, particularly in combinatorial mathematics and knowledge-intensive categories [38][41]. - However, the enhancement is limited in observation-intensive categories, raising questions about the effectiveness of current reasoning methods in these areas [42].