Codeforces难题不够刷？谢赛宁等造了个AI出题机，能生成原创编程题

Core Insights - The article discusses the importance of training large language models (LLMs) to generate high-quality programming competition problems, emphasizing that creating problems requires deeper algorithmic understanding than merely solving them [2][3][30] - The research introduces AutoCode, a framework that automates the entire lifecycle of problem creation and evaluation for competitive programming, utilizing a closed-loop, multi-role system [3][30] Group 1: Problem Creation and Evaluation - The ability to create programming competition problems is more challenging than solving them, as it requires a profound understanding of underlying algorithm design principles and data structures [2] - Existing testing datasets for programming competitions have high false positive rates (FPR) and false negative rates (FNR), which can distort the evaluation environment [2][14] - AutoCode employs a robust Validator-Generator-Checker framework to ensure high-quality input generation and minimize errors in problem evaluation [5][8][30] Group 2: Performance Metrics - AutoCode achieved a consistency rate of 91.1% in problem evaluation, significantly higher than previous methods, which did not exceed 81.0% [17] - The framework reduced FPR to 3.7% and FNR to 14.1%, representing approximately a 50% decrease compared to state-of-the-art techniques [17][19] - In a more challenging benchmark with 720 recent Codeforces problems, AutoCode maintained a consistency of 98.7%, validating its effectiveness on modern, difficult problems [19] Group 3: Novel Problem Generation - The team developed a novel problem generation framework that utilizes a dual verification protocol to ensure correctness without human intervention [23] - The process begins with a "seed problem," which is modified to create new, often more challenging problems, with a focus on generating high-quality reference solutions [23][24] - The dual verification protocol successfully filtered out 27% of error-prone problems, increasing the accuracy of reference solutions from 86% to 94% [24][30] Group 4: Findings on LLM Capabilities - LLMs can generate solvable problems that they themselves cannot solve, indicating a limitation in their creative capabilities [27][29] - The findings suggest that LLMs excel in "knowledge recombination" rather than true originality, often creating new problems by combining existing frameworks [32] - The difficulty increase of newly generated problems is typically greater than that of the seed problems, with optimal quality observed when seed problems are of moderate difficulty [32]