形式化推理
Search documents
深度思维正式推出“数学做题家AI”
Ke Ji Ri Bao· 2025-11-13 01:00
深度思维2004年曾透露其混合AI系统在同年的IMO竞赛中表现优异,仅差1分就能摘得金牌。而今 正式发布论文推出并详解该AI系统。 【总编辑圈点】 这一突破被认为是AI研究领域的又一个里程碑。这是因为用高水平竞赛题目测试AI系统,已成为 评估其逻辑推理、抽象思维和解决问题能力的重要标准。这类题目不仅要求严密的演绎推理,还涉及创 造性策略和跨领域知识整合,远超普通问答或模式识别任务。因此,能否在IMO等权威竞赛中取得好成 绩,被视为衡量AI是否具备"类人"深度推理能力的关键试金石。 数学家长期以来依赖计算工具辅助解决复杂问题和构建严谨证明,而AI有望加速这一过程。现 在,AI在形式化推理领域迈出了关键一步,不同于依赖模糊语言模型的通用AI,最新成果在严格逻辑 框架中运行,其每一步推理均可验证,极大提升了结果的可靠性。此举不仅突破了AI推理的局限,也 为探索复杂数学猜想提供了新工具,更为未来人机协作攻克前沿科学难题开辟了现实路径。其影响将辐 射至理论计算机科学、自动定理证明乃至基础数学研究等领域。 科技日报北京11月12日电 (记者张梦然)《自然》杂志12日发表了一项重要成果:英国深度思维 正式推出其开发的"数学 ...
普林斯顿团队领衔发布最强开源数学定理证明模型:32B性能大幅超越前代SOTA DeepSeek 671B
机器之心· 2025-07-17 05:03
Core Insights - The article discusses the launch of Goedel-Prover-V2, a new open-source mathematical theorem proving model led by Princeton University in collaboration with several top institutions, including Tsinghua University and Stanford University. The model significantly outperforms previous state-of-the-art models in various benchmarks [1][10]. Performance Highlights - The 32B flagship model achieved an 8.0% improvement in Pass@32 accuracy on the MiniF2F test compared to the previous SOTA model, DeepSeek-Prover-V2-671B [6]. - The 8B model demonstrated performance on par with the 671B SOTA model, showcasing efficiency and capability breakthroughs [7][22]. - Goedel-Prover-V2 ranked first on the challenging PutnamBench, solving 64 problems with a Pass@64 metric, outperforming DeepSeek-Prover-V2-671B, which solved 47 problems at Pass@1024 [9][14][20]. Technical Innovations - The development process of Goedel-Prover-V2 incorporates expert iteration and reinforcement learning, along with three key innovations: - Model averaging enhances robustness and overall performance by integrating model weights from different training nodes [12][32]. - Scaffolded data synthesis allows for the automatic generation of progressively challenging proof tasks, facilitating smoother training [13][26]. - Verifier-guided self-correction enables the model to iteratively refine its proofs using feedback from the Lean compiler, simulating human-like self-correction [13][32]. Benchmark Results - In the MiniF2F test, the 8B model achieved a Pass@32 rate of 83.3%, surpassing the performance of the 671B SOTA model [12]. - The flagship model reached Pass@32 rates of 88.1% in standard mode and 90.4% in self-correction mode, significantly exceeding previous models [12]. - The performance of Goedel-Prover-V2-32B remained consistently superior across various reasoning sampling budgets compared to earlier models [21][22]. Model and Dataset Availability - The Goedel-Prover-V2 model and the new MathOlympiadBench benchmark dataset have been publicly released to support research in the field [28][30]. - MathOlympiadBench includes 360 formalized problems from international mathematics competitions, aimed at enhancing preparation for events like the International Math Olympiad [30][31].