元验证
Search documents
DeepSeek上新:开源模型首达IMO金牌水平,AI推理告别“死记硬背”
Guan Cha Zhe Wang· 2025-11-28 07:17
Core Insights - DeepSeek has released its latest technology achievement, DeepSeek-Math-V2, which focuses on enhancing mathematical reasoning and theorem proving capabilities in large language models, boasting 685 billion parameters [1][5] Performance Highlights - DeepSeek-Math-V2 achieved gold medal levels in the 2025 International Mathematical Olympiad (IMO) and the 2024 Chinese Mathematical Olympiad (CMO), and scored 118 out of 120 in the Putnam 2024 competition, surpassing the historical human record of approximately 90 points [1][3] - In the IMO-ProofBench benchmark, Math-V2 scored nearly 99% on the basic set, significantly outperforming Google's Gemini DeepThink, which scored 89%. On the advanced set, Math-V2 scored 61.9%, slightly below Gemini DeepThink's 65.7% [4] Technological Innovations - DeepSeek-Math-V2 addresses the "illusion of reasoning" problem highlighted by former OpenAI chief scientist Ilya Sutskever, moving beyond mere answer correctness to ensure rigorous logical reasoning [5][6] - The model employs a strict "process-focused" strategy, requiring clear and logical step-by-step derivations, and does not reward correct final answers if intermediate steps are flawed [6] - A unique multi-level "Meta-Verification" mechanism enhances the reliability of scoring, increasing the confidence level from 0.85 to 0.96 [9] Industry Impact - The release of DeepSeek-Math-V2 has generated significant buzz in the overseas developer community, marking a strong comeback for DeepSeek and breaking the long-standing dominance of closed-source models in top reasoning capabilities [11] - The model's success in mathematical reasoning is expected to influence the coding model space, potentially disrupting existing code assistance tools [11] - The global AI landscape is transitioning from "text generation" to "logical reasoning," with DeepSeek's approach providing a clear path for technological evolution through rigorous validation mechanisms rather than sheer computational power [11]
DeepSeek再破谷歌OpenAI垄断:开源IMO数学金牌大模型
量子位· 2025-11-28 01:53
Core Insights - DeepSeek has released a new mathematical model, DeepSeekMath-V2, focusing on self-verifiable mathematical reasoning [1][7] - The model has achieved gold medal-level scores in IMO 2025 and CMO 2024, and scored 118/120 in Putnam 2024, surpassing the highest human score of 90 [2][43] - DeepSeekMath-V2 is the first open-source IMO gold medal model, raising competitive pressure on companies like Google and OpenAI [4][5] Model Performance - DeepSeekMath-V2 outperforms GPT-5-Thinking-High and Gemini 2.5-Pro across all CNML problem categories, including algebra, geometry, number theory, combinatorics, and inequalities [2][34] - The model's architecture includes 685 billion parameters, emphasizing strong proof verification capabilities [7] Training Methodology - The training process involves an iterative reinforcement learning loop that alternates between optimizing the proof verifier and the proof generator [9] - A large dataset of 17,500 proof-required math problems was collected from AoPS competitions to train the proof verifier [12] - The verifier is trained to identify issues in proofs and assign scores based on three levels of correctness [10] Meta-Verification Mechanism - A meta-verification mechanism was introduced to enhance the verifier's accuracy by assessing the validity of the identified issues [14] - The meta-verifier is trained using a dataset created from expert evaluations of the verifier's output [15] Proof Generation - The trained verifier serves as a reward model for the proof generator, which learns to self-review and correct its outputs [23] - The reward structure encourages accurate self-assessment and correction of errors in generated proofs [27] Automation and Efficiency - The collaboration between the verifier and generator leads to a fully automated data labeling process, replacing time-consuming manual annotations [29][35] - The automated process ensures high consistency with expert evaluations, significantly improving efficiency [35] Experimental Results - The model's average quality score for proof analysis improved from 0.85 to 0.96, demonstrating the effectiveness of the meta-verification mechanism [21] - The model's ability to generate correct proofs was validated through rigorous testing, showing superior performance across various mathematical problem categories [34][39]