生成式评测
Search documents
用「进化+压力测试」自动生成的竞赛级编程题,各家大模型谁更hold住?
机器之心· 2025-10-27 08:44
Core Insights - The article discusses the limitations of traditional algorithm benchmark testing and introduces the UniCode framework developed by Peking University and the General Artificial Intelligence Research Institute to address these issues [2][18]. Group 1: UniCode Framework Overview - UniCode is designed to automatically generate high-quality algorithm problems and pollution-resistant test cases, utilizing an evolutionary assessment system [2][5]. - The framework incorporates three complementary strategies for problem generation: single-problem extension, same-type fusion, and cross-type fusion, which enhance the diversity and challenge of the generated problems [5][7]. Group 2: Testing Methodology - A pressure-driven test case synthesis process achieves a 94.5% accuracy rate for test cases, outperforming multiple baseline methods [7][8]. - The evaluation process includes brute-force testing for small inputs, majority voting for larger inputs, and LLM adjudication for ambiguous cases, ensuring high reliability in the assessment [8][12]. Group 3: Performance Evaluation - The framework generated a benchmark set of 492 high-quality problems covering 15 core algorithm tags, which were used to evaluate 19 leading large language models (LLMs) [9][11]. - The best-performing model, o4-mini, achieved a pass rate of only 70.3%, indicating the high challenge level of the UniCode framework [9][11]. Group 4: Model Robustness and Generalization - The study found that most models performed similarly on original and shadow problems but showed significant drops in performance on UniCode-generated problems, highlighting the framework's ability to assess true algorithmic capabilities [11][12]. - The average performance drop exceeded 30% on new problems, demonstrating the distinction between superficial robustness and algorithm transfer ability [12][14]. Group 5: Benchmark Credibility - UniCode's credibility was validated through alignment with existing benchmarks, showing a high positive correlation with LiveCodeBench and a strong negative correlation with LiveCodeBenchPro [14][18]. - The framework's ability to generate a large number of problems, even with a small error rate, enhances its reliability compared to smaller, error-free benchmarks [16][20]. Group 6: Conclusion - UniCode advances the concept of generative assessment into a practical engineering system, providing a repeatable and traceable toolchain for evaluating code generation and algorithm generalization [18][22].