Workflow
Mathematical Reasoning
icon
Search documents
全球首个IMO金牌AI诞生!谷歌Gemini碾碎奥数神话,拿下35分震惊裁判
猿大侠· 2025-07-22 03:33
Core Viewpoint - Google DeepMind has officially announced that its model, Gemini Deep Think, has won a gold medal at the International Mathematical Olympiad (IMO) by solving five problems in 4.5 hours, achieving a score of 35 out of 42, which is a significant milestone for AI in mathematics [3][4][22]. Group 1: Achievement and Recognition - Gemini Deep Think is the first AI system to receive official gold medal recognition from the IMO committee [6][7]. - The IMO, held annually since 1959, is a prestigious competition that tests the mathematical abilities of students worldwide [11][12]. - The competition requires participants to solve six complex mathematical problems within a limited time, with only the top 8% receiving gold medals [13][16]. Group 2: Technical Aspects of Gemini Deep Think - Unlike previous models, Gemini Deep Think operates entirely in natural language, allowing it to generate rigorous mathematical proofs directly from problem descriptions [29][32]. - The model employs advanced reasoning techniques, including parallel thinking, enabling it to explore multiple solution paths simultaneously [33][38]. - The training of Gemini involved a combination of reinforcement learning and access to a curated database of high-quality mathematical solutions [37][126]. Group 3: Problem-Solving Process - The model's approach to the problems was methodical, breaking down complex proofs into clear, understandable steps [24][41]. - For example, in the first problem, the model simplified the problem to a specific case and established a lemma to prove the core condition [44][50]. - The solutions provided by Gemini were noted for their clarity and precision, earning praise from IMO judges [24][87]. Group 4: Future Implications - Google plans to make the advanced version of Gemini Deep Think available to select mathematicians and Google AI Ultra subscribers in the future [39]. - The success of Gemini Deep Think highlights the potential for AI to contribute significantly to the field of mathematics, combining natural language capabilities with rigorous reasoning [102][105].
DeepSeek开源新模型,数学推理能力大提升
Hu Xiu· 2025-05-01 00:48
Core Insights - DeepSeek has officially released DeepSeek-Prover-V2 on Hugging Face, continuing its open-source momentum with two versions launched [1][4] - The training core of DeepSeek-Prover-V2 combines "recursion + reinforcement learning," enabling the model to break down complex theorems into sub-goals and reasoning paths [3][8] Model Specifications - DeepSeek-Prover-V2-7B is based on the previous V1.5 model and supports a maximum context input of 32K [4] - DeepSeek-Prover-V2-671B is trained on the DeepSeek-V3-Base, showcasing the strongest reasoning performance [4] Training Process - The training process consists of two phases: the first phase focuses on rapid mode using an "expert iteration" method, where successful answers refine the model [5] - In the second phase, more complex logical reasoning capabilities are trained, incorporating mathematical knowledge from DeepSeek-V3 and formal data [6] Reinforcement Learning - The GRPO reinforcement learning algorithm is introduced to enhance reasoning capabilities, allowing the model to autonomously learn to select optimal solutions from multiple candidates [8] - The system generates 32 different proof schemes for each theorem, retaining only those verified as correct by the Lean verification system [9] Model Distillation - After developing the powerful 671B model, the team distilled its capabilities into a smaller 7B model, allowing users to achieve near-equivalent mathematical reasoning abilities on resource-limited devices [10][11] Reasoning Modes - The rapid mode (non-CoT) focuses on speed, generating concise Lean code answers without showing the thought process, suitable for handling numerous problems [12] - The logical mode (CoT) details each step of the reasoning process, ensuring clarity and transparency [12] Performance Evaluation - In the final performance assessment, DeepSeek-Prover-V2-671B achieved an 88.9% pass rate in the MiniF2F test, successfully solving 49 problems from the PutnamBench dataset [17] New Dataset - DeepSeek introduced a new formal mathematical dataset, ProverBench, containing 325 problems across various mathematical domains, including number theory, algebra, and calculus [18][19] Comparison and Trends - The comparison shows a significant trend: the performance gap between large language models in "informal mathematical reasoning" and "formal mathematical reasoning" is narrowing [21] - The evolution of model structure and training strategies enables models to produce rigorous, verifiable mathematical proofs [22] Future Directions - DeepSeek-Prover-V2 indicates a shift in focus from merely generating content to generating structured logic, which may touch upon the foundational structure of general artificial intelligence [33][34]