形式化数学推理
Search documents
北大华为联队夺冠:形式化数学竞赛33支队伍角逐,国产大模型啃下形式化证明硬骨头
量子位· 2025-12-20 06:30
Lean说的都队 投稿 量子位 | 公众号 QbitAI 当大语言模型在数学推理中频频出现"幻觉",如何让AI的数学证明像人类数学家一样严谨可靠? 这个困扰AI研究界多年的难题,在近日落幕的CCF"面向大模型的形式化数学竞赛"中找到了突破性答案。 一支名为"Lean说的都队"的联合队伍从33支参赛队伍中脱颖而出,以总分第一的成绩斩获冠军。这支北大华为的联合队伍,凭借华为 openPangu-Ultra-MoE-718B和创新的技术架构,在形式化数学推理这一"AI硬骨头"上实现了重要突破。 权威赛事:瞄准大模型的数学"硬伤" 这项由中国计算机学会主办、蚂蚁数字科技等多家知名机构支持的竞赛,旨在解决大模型在数学推理中的核心痛点——"幻觉"和不可靠问题。 作为CCF大数据与计算智能大赛 (CCF BDCI) 的重要组成部分,该赛事吸引了来自全球的33支顶尖团队参与。 与传统数学问答不同,竞赛要求参赛模型将自然语言描述的数学问题,直接转化为能被计算机验证的形式化证明代码 (Lean/Litex) ,整个 过程禁止使用任何自然语言解释。这相当于要求AI既要是数学家,又要是程序员,既要理解数学问题的本质,又要用严格的编程 ...
这才是IMO奥赛战神:满分,5战3金,刚被MIT录取
机器之心· 2025-07-23 10:36
Core Viewpoint - The article highlights the impressive performance of AI models, particularly the Seed Prover from ByteDance, in the International Mathematical Olympiad (IMO), alongside the remarkable achievements of human contestant Warren Bei, who scored a perfect 42/42, showcasing the intersection of AI and human intelligence in mathematics [3][4][5]. Group 1: AI Performance - The Seed Prover model from ByteDance successfully solved 4 out of 6 problems in the IMO, achieving a score of 30 points, which is recognized as a silver medal performance [4]. - The article emphasizes the growing interest and advancements in AI's capabilities in formal mathematical reasoning, particularly in competitive environments like the IMO [3][4]. Group 2: Warren Bei's Achievements - Warren Bei, an 11th-grade student from Canada, achieved a perfect score of 42/42 at the IMO, a feat accomplished by only five contestants globally this year [5][6]. - His journey in mathematics includes five years of participation in the IMO, culminating in three gold medals and two silver medals, reflecting consistent improvement and dedication [9][15]. - Warren's accolades also include winning the Canadian Mathematics Olympiad (CMO) multiple times, starting from a young age, which has established him as a prominent figure in the mathematics community [16][17]. Group 3: Personal Insights and Future Aspirations - Warren Bei expresses a passion for mathematics, stating that the joy lies in the process of problem-solving rather than the awards themselves [18]. - He maintains an open attitude towards his future, considering various academic paths while emphasizing the importance of understanding the practical applications of mathematics [12][13]. - His approach to challenges in mathematics is philosophical, focusing on intuition and perseverance as key to overcoming difficulties [19].
挑战AI数学推理极限!大规模形式化数学基准FormalMATH发布,最强模型成功率仅16%
量子位· 2025-05-07 09:33
Core Insights - The FormalMATH benchmark test, developed by institutions such as The Chinese University of Hong Kong and Zhejiang University, consists of 5,560 rigorously validated mathematical problems, covering various fields from Olympiad level to undergraduate courses, and is 22.8 times larger than existing benchmarks [1][5][4]. Group 1: Performance of LLMs - The performance of current LLM-driven theorem provers is significantly below expectations, with the best model, Kimina-Prover, achieving a success rate of only 16.46% under resource constraints [3][15]. - Most models perform close to random guessing in calculus and other areas, indicating a substantial capability gap [3][7]. - There is a notable domain bias, with better performance in algebra compared to weaker results in calculus [11][12]. Group 2: Error Analysis - Common error patterns include: - Redundant assumptions (34%): Introducing irrelevant premises [16]. - Incomplete proofs (62%): Missing critical steps in the proof [16]. - Misuse of automation strategies (65%): Incorrectly applying automated tools [16]. - Inability to handle inequalities correctly (13%): Over-reliance on automated inequality calculation strategies [16]. - The analysis shows that LLM provers often resort to shortcut tactics, which leads to significant errors [14]. Group 3: Future Directions - To enhance the formal reasoning capabilities of LLMs, three areas of focus are proposed: - Strengthening multi-step planning to reduce reliance on single-step tactics [19]. - Cross-domain generalization through curriculum learning to balance training data across different mathematical fields [19]. - Development of interactive proof-assistance tools for collaboration between LLMs and human experts [19]. Group 4: Open Source Initiative - The research team has made the FormalMATH benchmark's code, training data, and evaluation models publicly available, encouraging collaboration between academia and industry to advance formal mathematical reasoning technologies [20][21].
AI的下一个风口?听前DeepSeek成员辛华剑解读数学推理 | Deep Talk
锦秋集· 2025-05-03 08:51
Core Viewpoint - DeepSeek has released a new model named DeepSeek-Prover-V2-671B, which focuses on formal mathematical reasoning, addressing a significant challenge in AI and opening up high-value commercial opportunities [1][2]. Group 1: Model Development and Impact - DeepSeek-Prover series models combine the generalization capabilities of large language models (LLMs) with formal tools like Lean, achieving large-scale end-to-end conversion from natural language descriptions to machine-verifiable proofs [2]. - This breakthrough could potentially enhance the efficiency of mathematical research several times over and create new possibilities for AI applications in fields that require mathematical rigor, such as financial modeling, chip verification, and cryptography [2]. Group 2: Event Information - A cross-ocean dialogue event will take place on May 9, 2025, featuring DeepSeek's former member Xin Huajian, who will discuss the formal mathematical revolution in the era of large language models [3][4]. - The event will also include a presentation by Zang Tianyu from Jinqiu Capital on AI investment trends for 2025 [3][4]. Group 3: Organizers and Participants - Jinqiu Capital focuses on AI investments and has a 12-year long-term fund, actively supporting early-stage entrepreneurs with a strategy of aggressive follow-on investments [6]. - The Cambridge China AI Association aims to connect the Chinese AI industry with global academia and industry, facilitating efficient resource flow between China and the UK [7].