Workflow
AutoCode
icon
Search documents
Codeforces难题不够刷?谢赛宁等造了个AI出题机,能生成原创编程题
3 6 Ke· 2025-10-20 08:15
Core Insights - The article discusses the importance of training large language models (LLMs) to generate high-quality programming competition problems, emphasizing that creating problems requires deeper algorithmic understanding than merely solving them [2][3][30] - The research introduces AutoCode, a framework that automates the entire lifecycle of problem creation and evaluation for competitive programming, utilizing a closed-loop, multi-role system [3][30] Group 1: Problem Creation and Evaluation - The ability to create programming competition problems is more challenging than solving them, as it requires a profound understanding of underlying algorithm design principles and data structures [2] - Existing testing datasets for programming competitions have high false positive rates (FPR) and false negative rates (FNR), which can distort the evaluation environment [2][14] - AutoCode employs a robust Validator-Generator-Checker framework to ensure high-quality input generation and minimize errors in problem evaluation [5][8][30] Group 2: Performance Metrics - AutoCode achieved a consistency rate of 91.1% in problem evaluation, significantly higher than previous methods, which did not exceed 81.0% [17] - The framework reduced FPR to 3.7% and FNR to 14.1%, representing approximately a 50% decrease compared to state-of-the-art techniques [17][19] - In a more challenging benchmark with 720 recent Codeforces problems, AutoCode maintained a consistency of 98.7%, validating its effectiveness on modern, difficult problems [19] Group 3: Novel Problem Generation - The team developed a novel problem generation framework that utilizes a dual verification protocol to ensure correctness without human intervention [23] - The process begins with a "seed problem," which is modified to create new, often more challenging problems, with a focus on generating high-quality reference solutions [23][24] - The dual verification protocol successfully filtered out 27% of error-prone problems, increasing the accuracy of reference solutions from 86% to 94% [24][30] Group 4: Findings on LLM Capabilities - LLMs can generate solvable problems that they themselves cannot solve, indicating a limitation in their creative capabilities [27][29] - The findings suggest that LLMs excel in "knowledge recombination" rather than true originality, often creating new problems by combining existing frameworks [32] - The difficulty increase of newly generated problems is typically greater than that of the seed problems, with optimal quality observed when seed problems are of moderate difficulty [32]
Codeforces难题不够刷?谢赛宁等造了个AI出题机,能生成原创编程题
机器之心· 2025-10-20 04:50
Core Insights - The article discusses the importance of training large language models (LLMs) to generate high-quality programming problems, which is crucial for advancing their capabilities towards artificial general intelligence (AGI) [1][3]. Group 1: Problem Creation and Evaluation - Creating programming competition problems requires a deeper understanding of algorithms compared to merely solving them, as competition problems have strict standards to evaluate underlying algorithm design principles [2]. - The ability to generate better problems will lead to more rigorous benchmarks for competitive programming, as existing datasets often suffer from high false positive and false negative rates [2][21]. - The AutoCode framework, developed by the LiveCodeBench Pro team, automates the entire lifecycle of creating and evaluating competitive programming problems using LLMs [3][7]. Group 2: Framework Components - The AutoCode framework consists of a Validator, Generator, and Checker, ensuring that inputs adhere to problem constraints and minimizing false negatives [8][10]. - The Generator employs diverse strategies to create a wide range of inputs, aiming to reduce false positive rates, while the Checker compares outputs against reference solutions [12][14]. - A dual verification protocol is introduced to ensure correctness without human intervention, significantly improving the quality of generated problems [29]. Group 3: Performance Metrics - The AutoCode framework achieved a consistency rate of 91.1% with a false positive rate of 3.7% and a false negative rate of 14.1%, marking a significant improvement over previous methods [21][22]. - In a more challenging benchmark with 720 recent Codeforces problems, AutoCode maintained a consistency of 98.7%, validating its effectiveness on modern, difficult problems [24]. - The framework's performance was further validated through ablation studies, confirming the effectiveness of its components [26]. Group 4: Novel Problem Generation - The team established a new problem generation framework that builds on robust test case generation, introducing a dual verification protocol to ensure correctness [29]. - LLMs can generate solvable problems that they themselves cannot solve, indicating a strength in knowledge recombination rather than original innovation [34]. - The quality of generated problems is assessed based on difficulty and the increase in difficulty compared to seed problems, providing reliable indicators of problem quality [34][38]. Group 5: Conclusion - The AutoCode framework represents a significant advancement in using LLMs as problem setters for competitive programming, achieving state-of-the-art reliability in test case generation and producing new, competition-quality problems [36]. - Despite the model's strengths in algorithmic knowledge recombination, it struggles to introduce truly novel reasoning paradigms or flawless example designs [37].