测试用例生成 - filings, earnings calls, financial reports, news

测试用例生成

Search documents

机器之心· 2025-10-20 04:50

Core Insights - The article discusses the importance of training large language models (LLMs) to generate high-quality programming problems, which is crucial for advancing their capabilities towards artificial general intelligence (AGI) [1][3]. Group 1: Problem Creation and Evaluation - Creating programming competition problems requires a deeper understanding of algorithms compared to merely solving them, as competition problems have strict standards to evaluate underlying algorithm design principles [2]. - The ability to generate better problems will lead to more rigorous benchmarks for competitive programming, as existing datasets often suffer from high false positive and false negative rates [2][21]. - The AutoCode framework, developed by the LiveCodeBench Pro team, automates the entire lifecycle of creating and evaluating competitive programming problems using LLMs [3][7]. Group 2: Framework Components - The AutoCode framework consists of a Validator, Generator, and Checker, ensuring that inputs adhere to problem constraints and minimizing false negatives [8][10]. - The Generator employs diverse strategies to create a wide range of inputs, aiming to reduce false positive rates, while the Checker compares outputs against reference solutions [12][14]. - A dual verification protocol is introduced to ensure correctness without human intervention, significantly improving the quality of generated problems [29]. Group 3: Performance Metrics - The AutoCode framework achieved a consistency rate of 91.1% with a false positive rate of 3.7% and a false negative rate of 14.1%, marking a significant improvement over previous methods [21][22]. - In a more challenging benchmark with 720 recent Codeforces problems, AutoCode maintained a consistency of 98.7%, validating its effectiveness on modern, difficult problems [24]. - The framework's performance was further validated through ablation studies, confirming the effectiveness of its components [26]. Group 4: Novel Problem Generation - The team established a new problem generation framework that builds on robust test case generation, introducing a dual verification protocol to ensure correctness [29]. - LLMs can generate solvable problems that they themselves cannot solve, indicating a strength in knowledge recombination rather than original innovation [34]. - The quality of generated problems is assessed based on difficulty and the increase in difficulty compared to seed problems, providing reliable indicators of problem quality [34][38]. Group 5: Conclusion - The AutoCode framework represents a significant advancement in using LLMs as problem setters for competitive programming, achieving state-of-the-art reliability in test case generation and producing new, competition-quality problems [36]. - Despite the model's strengths in algorithmic knowledge recombination, it struggles to introduce truly novel reasoning paradigms or flawless example designs [37].