数学定理证明模型

Search documents
普林斯顿团队领衔发布最强开源数学定理证明模型:32B性能大幅超越前代SOTA DeepSeek 671B
机器之心· 2025-07-17 05:03
Core Insights - The article discusses the launch of Goedel-Prover-V2, a new open-source mathematical theorem proving model led by Princeton University in collaboration with several top institutions, including Tsinghua University and Stanford University. The model significantly outperforms previous state-of-the-art models in various benchmarks [1][10]. Performance Highlights - The 32B flagship model achieved an 8.0% improvement in Pass@32 accuracy on the MiniF2F test compared to the previous SOTA model, DeepSeek-Prover-V2-671B [6]. - The 8B model demonstrated performance on par with the 671B SOTA model, showcasing efficiency and capability breakthroughs [7][22]. - Goedel-Prover-V2 ranked first on the challenging PutnamBench, solving 64 problems with a Pass@64 metric, outperforming DeepSeek-Prover-V2-671B, which solved 47 problems at Pass@1024 [9][14][20]. Technical Innovations - The development process of Goedel-Prover-V2 incorporates expert iteration and reinforcement learning, along with three key innovations: - Model averaging enhances robustness and overall performance by integrating model weights from different training nodes [12][32]. - Scaffolded data synthesis allows for the automatic generation of progressively challenging proof tasks, facilitating smoother training [13][26]. - Verifier-guided self-correction enables the model to iteratively refine its proofs using feedback from the Lean compiler, simulating human-like self-correction [13][32]. Benchmark Results - In the MiniF2F test, the 8B model achieved a Pass@32 rate of 83.3%, surpassing the performance of the 671B SOTA model [12]. - The flagship model reached Pass@32 rates of 88.1% in standard mode and 90.4% in self-correction mode, significantly exceeding previous models [12]. - The performance of Goedel-Prover-V2-32B remained consistently superior across various reasoning sampling budgets compared to earlier models [21][22]. Model and Dataset Availability - The Goedel-Prover-V2 model and the new MathOlympiadBench benchmark dataset have been publicly released to support research in the field [28][30]. - MathOlympiadBench includes 360 formalized problems from international mathematics competitions, aimed at enhancing preparation for events like the International Math Olympiad [30][31].