数学推理 - filings, earnings calls, financial reports, news - Reportify

数学推理

Search documents

DeepSeek上新：开源模型首达IMO金牌水平，AI推理告别“死记硬背”

Guan Cha Zhe Wang· 2025-11-28 07:17

Core Insights - DeepSeek has released its latest technology achievement, DeepSeek-Math-V2, which focuses on enhancing mathematical reasoning and theorem proving capabilities in large language models, boasting 685 billion parameters [1][5] Performance Highlights - DeepSeek-Math-V2 achieved gold medal levels in the 2025 International Mathematical Olympiad (IMO) and the 2024 Chinese Mathematical Olympiad (CMO), and scored 118 out of 120 in the Putnam 2024 competition, surpassing the historical human record of approximately 90 points [1][3] - In the IMO-ProofBench benchmark, Math-V2 scored nearly 99% on the basic set, significantly outperforming Google's Gemini DeepThink, which scored 89%. On the advanced set, Math-V2 scored 61.9%, slightly below Gemini DeepThink's 65.7% [4] Technological Innovations - DeepSeek-Math-V2 addresses the "illusion of reasoning" problem highlighted by former OpenAI chief scientist Ilya Sutskever, moving beyond mere answer correctness to ensure rigorous logical reasoning [5][6] - The model employs a strict "process-focused" strategy, requiring clear and logical step-by-step derivations, and does not reward correct final answers if intermediate steps are flawed [6] - A unique multi-level "Meta-Verification" mechanism enhances the reliability of scoring, increasing the confidence level from 0.85 to 0.96 [9] Industry Impact - The release of DeepSeek-Math-V2 has generated significant buzz in the overseas developer community, marking a strong comeback for DeepSeek and breaking the long-standing dominance of closed-source models in top reasoning capabilities [11] - The model's success in mathematical reasoning is expected to influence the coding model space, potentially disrupting existing code assistance tools [11] - The global AI landscape is transitioning from "text generation" to "logical reasoning," with DeepSeek's approach providing a clear path for technological evolution through rigorous validation mechanisms rather than sheer computational power [11]

Seek .(US:SKLTY)

DeepSeek-Math-V2

DeepSeek-Math-V2

DeepSeek上新，“奥数金牌水平”

Di Yi Cai Jing· 2025-11-28 00:40

Core Insights - DeepSeek has released a new model, DeepSeek-Math-V2, which is the first open-source model to achieve International Mathematical Olympiad (IMO) gold medal level performance [3][5] - The model outperforms Google's Gemini DeepThink in certain benchmarks, showcasing its capabilities in mathematical reasoning [5][9] Performance Metrics - DeepSeek-Math-V2 achieved 83.3% in IMO 2025 and 73.8% in CMO 2024, while scoring 98.3% in the Putnam 2024 competition [4] - In the Basic benchmark, Math-V2 scored nearly 99%, significantly higher than Gemini DeepThink's 89%, but in the Advanced subset, Math-V2 scored 61.9%, slightly lower than Gemini's 65.7% [5] Research Implications - The paper titled "DeepSeek Math-V2: Towards Self-Validating Mathematical Reasoning" emphasizes the importance of rigorous mathematical proof processes rather than just correct answers [8] - DeepSeek advocates for self-validation in mathematical reasoning to enhance the development of more powerful AI systems [8] Industry Reactions - The release of Math-V2 has generated excitement in the industry, with comments highlighting its unexpected success over Google's model [9] - The competitive landscape is evolving, with other major players like OpenAI and Google releasing new models, raising anticipation for DeepSeek's next moves [10]

Seek .(US:SKLTY)

DeepSeek-Math-V2

Gemini DeepThink

DeepSeek-Math-V2

Gemini DeepThink

DeepSeek上新，“奥数金牌水平”

第一财经· 2025-11-28 00:35

Core Viewpoint - DeepSeek has released an open-source model, DeepSeek-Math-V2, which is the first model to achieve IMO gold medal level in mathematics and outperforms Google's Gemini DeepThink in certain benchmarks [3][5]. Group 1: Model Performance - DeepSeek-Math-V2 achieved nearly 99% on the Basic benchmark, significantly outperforming Gemini DeepThink, which scored 89% [5]. - In the more challenging Advanced subset, Math-V2 scored 61.9%, slightly below Gemini DeepThink's 65.7% [5]. - The model has demonstrated gold medal-level performance in IMO 2025 and CMO 2024, and nearly perfect scores in the Putnam 2024 exam (118/120) [8]. Group 2: Research and Development Insights - DeepSeek emphasizes the importance of verifying mathematical reasoning comprehensively and rigorously, moving from a result-oriented approach to a process-oriented one [8]. - The model is designed to teach AI to review proof processes like a mathematician, enhancing its ability to solve complex mathematical proofs without human intervention [8]. Group 3: Industry Reactions and Expectations - The release of Math-V2 has generated excitement in the industry, with reactions noting that DeepSeek has surpassed expectations by defeating Google's IMO Gold model by a 10% margin [9]. - There is anticipation regarding DeepSeek's next moves, especially concerning updates to its flagship models, as the industry awaits further developments [9].

Artificial Intelligence

DeepSeek-Math-V2

Gemini DeepThink

Artificial Intelligence

DeepSeek-Math-V2

Gemini DeepThink

DeepSeek上新！首个奥数金牌水平的模型来了

Di Yi Cai Jing· 2025-11-28 00:22

Core Insights - DeepSeek has released a new model, DeepSeek-Math-V2, which is the first open-source model to achieve International Mathematical Olympiad (IMO) gold medal level performance [1] - The model outperforms Google's Gemini DeepThink in certain benchmarks, showcasing its capabilities in mathematical reasoning [1][5] Performance Metrics - DeepSeek-Math-V2 achieved 83.3% on IMO 2025 problems and 73.8% on CMO 2024 problems [4] - In the Putnam 2024 competition, it scored 98.3%, demonstrating exceptional performance [4] - On the Basic benchmark, Math-V2 scored nearly 99%, while Gemini DeepThink scored 89% [5] - In the Advanced subset, Math-V2 scored 61.9%, slightly below Gemini DeepThink's 65.7% [5] Research and Development Focus - The model emphasizes self-verification in mathematical reasoning, moving from a result-oriented approach to a process-oriented one [8] - DeepSeek aims to enhance the rigor and completeness of mathematical proofs, which is crucial for solving open problems [8] - The research indicates that self-verifying mathematical reasoning is a viable direction for developing more powerful AI systems [8] Industry Reaction - The release has generated significant interest, with comments highlighting DeepSeek's competitive edge over Google's model [9] - The industry is keenly awaiting further developments from DeepSeek, especially regarding their flagship model updates [10]

Seek .(US:SKLTY)

可自我验证的数学推理

DeepSeek-Math-V2

可自我验证的数学推理

DeepSeek-Math-V2

Qwen又立功，全球最快开源模型诞生，超2000 tokens/秒

3 6 Ke· 2025-09-10 12:19

Core Insights - The K2 Think model, developed by MBZUAI and G42 AI, is touted as the fastest open-source AI model, achieving a speed of over 2000 tokens per second, specifically 2730.4 tokens per second in tests [1][3][9] - K2 Think is claimed to be the most advanced open-source AI inference system to date, with a focus on mathematical reasoning [2][9] - The model is based on Qwen 2.5-32B and has been designed to excel in complex problem-solving through innovative training techniques [1][12] Performance Metrics - K2 Think has demonstrated consistent performance, maintaining speeds above 2000 tokens per second across various tests, including mathematical problems [3][7] - The model achieved notable scores in multiple mathematical benchmarks, such as 90.83 in AIME'24 and 81.24 in AIME'25 [9] Technical Innovations - The K2 Think team implemented six key innovations to enhance the model's capabilities: - Supervised Fine-Tuning (SFT) for structured reasoning [12] - Reinforcement Learning with Verifiable Rewards (RLVR) to improve performance in logic and mathematics [12] - Planning before reasoning to outline problem-solving strategies [12] - Best-of-N sampling to generate multiple answers and select the best [12] - Speculative Decoding for parallel answer generation and validation [12] - Hardware acceleration using Cerebras WSE for high-speed token generation [12] Safety and Security - The K2 Think team conducted comprehensive safety testing, ensuring robustness against harmful requests and information leaks [12]

开源大模型

开源大模型

14B打败671B，微软rStar2-Agent在数学推理上超过DeepSeek-R1

3 6 Ke· 2025-09-02 07:36

Core Insights - The article discusses the advancements in reasoning capabilities of large language models (LLMs) through test-time scaling and the introduction of agentic reinforcement learning, specifically highlighting the development of the rStar2-Agent model by a Microsoft research team [1][2]. Group 1: Model Development - Microsoft has developed a powerful agentic reinforcement learning method called rStar2-Agent, which includes a 14 billion parameter reasoning model that achieves state-of-the-art performance, surpassing even the 671 billion parameter DeepSeek-R1 model [2][17]. - The rStar2-Agent model demonstrates significant improvements in mathematical reasoning tasks, achieving an accuracy of 80.6% on the AIME24 benchmark, outperforming several leading models [19]. Group 2: Innovations and Techniques - The research team introduced three key innovations for the rStar2-Agent: 1. A high-throughput, independent code environment capable of handling 45,000 concurrent tool calls with an average feedback execution time of 0.3 seconds [10]. 2. A group relative policy optimization method (GRPO-RoC) that combines GRPO with correct resampling to address noise in the environment caused by sparse rewards [12][14]. 3. A training scheme that enhances a pre-trained 14 billion parameter model to achieve advanced reasoning capabilities with minimal computational resources [15][16]. Group 3: Performance Metrics - The rStar2-Agent-14B model achieved remarkable results in various reasoning benchmarks, including: - 80.6% accuracy on AIME24, 69.8% on AIME25, and 52.7% on HMMT25, demonstrating consistent high performance across tasks [19]. - It also outperformed DeepSeek-V3 in scientific reasoning benchmarks and showed competitive results in general alignment tests [22]. Group 4: Broader Implications - Despite being trained primarily on mathematical tasks, the rStar2-Agent model exhibits effective generalization capabilities, indicating its potential for broader applications in cognitive reasoning [21].

主动式强化学习

软件与服务

rStar2-Agent-14B

Python编程工具

主动式强化学习

软件与服务

rStar2-Agent-14B

Python编程工具

14B打败671B！微软rStar2-Agent在数学推理上超过DeepSeek-R1

机器之心· 2025-09-02 01:27

Core Viewpoint - The article discusses the advancements in large language models (LLMs) through the introduction of rStar2-Agent, a powerful agentic reinforcement learning method developed by Microsoft Research, which enhances reasoning capabilities and performance in mathematical reasoning tasks. Group 1: Model Development and Innovations - The rStar2-Agent model utilizes test-time scaling to enhance reasoning capabilities, allowing for longer and smarter thinking processes through the integration of advanced cognitive abilities and tool interactions [1][2]. - The model was trained using a 14 billion parameter architecture, achieving performance levels comparable to or exceeding that of larger models like DeepSeek-R1, which has 671 billion parameters [2][25]. - The training infrastructure developed for rStar2-Agent can handle 45,000 concurrent tool calls with an average feedback execution time of just 0.3 seconds, significantly improving training efficiency [14][13]. Group 2: Training Methodology - The team introduced a novel training scheme that begins with a non-reasoning supervised fine-tuning (SFT) phase, focusing on general instruction following and tool usage, which helps avoid overfitting and maintains shorter initial responses [21][19]. - The GRPO-RoC method was implemented to enhance the efficiency of active reinforcement learning in the coding environment, allowing for better handling of noise and improving the quality of training trajectories [19][18]. - The model achieved state-of-the-art mathematical reasoning performance with only 510 reinforcement learning steps, demonstrating exceptional training efficiency [23][25]. Group 3: Performance Metrics - rStar2-Agent-14B achieved an accuracy of 80.6% on the AIME24 benchmark, outperforming other models such as o3-mini, DeepSeek-R1, and Claude Opus 4.0 by margins of 1.0%, 0.8%, and 3.6% respectively [26]. - The model exhibited strong generalization capabilities beyond mathematics, despite being trained primarily on mathematical tasks [27]. - In terms of response length, rStar2-Agent-14B produced shorter average responses compared to larger models, indicating a more efficient reasoning process [29].

Microsoft(US:MSFT)

主动式强化学习

rStar2-Agent-14B

主动式强化学习

rStar2-Agent-14B

4B小模型数学推理首超Claude 4，700步RL训练逼近235B性能 | 港大&字节Seed&复旦

量子位· 2025-07-09 01:18

Core Viewpoint - The Polaris model, developed by a collaboration between the University of Hong Kong's NLP team, ByteDance Seed, and Fudan University, demonstrates superior mathematical reasoning capabilities compared to leading commercial models, achieving scores of 79.4 on AIME25 and 81.2 on AIME24 [1][53]. Group 1: Model Performance and Training - Polaris utilizes Scaling Reinforcement Learning (RL) to enhance the mathematical reasoning abilities of the 4B model, surpassing various commercial models such as Seed-1.5-thinking and Claude-4-Opus [1][5]. - The lightweight nature of Polaris-4B allows deployment on consumer-grade graphics cards [2]. - The research team confirmed that Scaling RL can replicate significant performance improvements in cutting-edge open-source models like Qwen3 [5]. Group 2: Training Data and Methodology - The success of Polaris hinges on tailored training data and hyperparameter settings that align with the model being trained [7]. - The team discovered a mirrored difficulty distribution in the training data, indicating that the same dataset presents varying challenges to models of different capabilities [8][10]. - A dynamic updating strategy for training data was implemented, allowing the model to adapt as it improves, ensuring that overly easy samples are removed during training [13]. Group 3: Sampling Diversity and Temperature Control - Diversity in sampling is crucial for enhancing model performance, allowing exploration of broader reasoning paths [14]. - The team identified that common temperature settings (0.6 and 1.0) were too low, limiting the model's exploration capabilities [27]. - A three-zone temperature framework was established: Robust Generation Zone, Controlled Exploration Zone, and Performance Collapse Zone, guiding the selection of optimal sampling temperatures [28]. Group 4: Long Context Training and Performance - The model's pre-training context length was limited to 32K, but during RL training, it was extended to 52K, addressing the challenge of long-context training [37]. - The introduction of length extrapolation techniques improved the accuracy of long text generation from 26% to over 50% [41]. - A multi-stage training approach was adopted, gradually increasing context window lengths to enhance reasoning capabilities [48]. Group 5: Evaluation and Results - Polaris achieved the highest performance in most evaluations, demonstrating its effectiveness in mathematical reasoning tasks [53].

高考数学斩获139分！小米7B模型比肩Qwen3-235B、OpenAI o3

机器之心· 2025-06-16 05:16

Core Viewpoint - The article discusses the performance of various AI models in the 2025 mathematics exam, highlighting the competitive landscape in AI model capabilities, particularly focusing on Xiaomi's MiMo-VL model which performed impressively despite its smaller parameter size [2][4][20]. Group 1: Model Performance - Gemini 2.5 Pro scored 145 points, ranking first, followed closely by Doubao and DeepSeek R1 with 144 points [2]. - MiMo-VL, a 7B parameter model, scored 139 points, matching Qwen3-235B and only one point lower than OpenAI's o3 [4]. - MiMo-VL outperformed Qwen2.5-VL-7B by 56 points, showcasing its superior capabilities despite having the same parameter size [5]. Group 2: Evaluation Methodology - MiMo-VL-7B and Qwen2.5-VL-7B were evaluated using uploaded question screenshots, while other models used text input [6]. - The evaluation included 14 objective questions (totaling 73 points) and 5 answer questions (totaling 77 points) [7]. Group 3: Detailed Scoring Breakdown - MiMo-VL scored 35 out of 40 in single-choice questions and achieved full marks in multiple-choice and fill-in-the-blank questions [8][10][11]. - In the answer questions, MiMo-VL scored 71 points, ranking fifth overall, surpassing hunyuan-t1-latest and 文心 X1 Turbo [12]. Group 4: Technological Advancements - Xiaomi announced the open-sourcing of its first inference-focused large model, MiMo, which has shown significant improvements in reasoning capabilities [14]. - MiMo-VL, as a successor to MiMo-7B, has demonstrated substantial advancements in multi-modal reasoning tasks, outperforming larger models like Qwen-2.5-VL-72B [20]. - The model's performance is attributed to high-quality pre-training data and an innovative mixed online reinforcement learning algorithm [27][29]. Group 5: Open Source and Accessibility - MiMo-VL-7B's technical report, model weights, and evaluation framework have been made open source, promoting transparency and accessibility in AI development [32].

XIAOMI(HK:01810)

多模态推理

Artificial Intelligence

多模态推理

Artificial Intelligence

32B本地部署！阿里开源最新多模态模型：主打视觉语言，数学推理也很强

量子位· 2025-03-25 00:59

Core Viewpoint - The article discusses the release of the Qwen2.5-VL-32B-Instruct model by Alibaba's Tongyi Qwen, highlighting its advancements in performance and capabilities compared to previous models and competitors. Group 1: Model Specifications - The Qwen2.5-VL family includes three sizes: 3B, 7B, and 72B, with the new 32B version balancing size and performance for local operation [2][3]. - The 32B version has undergone reinforcement learning optimization, achieving state-of-the-art (SOTA) performance in pure text capabilities, even surpassing the 72B model in several benchmarks [4]. Group 2: Performance Improvements - The Qwen2.5-VL-32B demonstrates enhanced mathematical reasoning abilities, image analysis, content recognition, and visual logic deduction, providing clearer and more human-like responses [5]. - For example, it can analyze a traffic sign image and accurately calculate travel time based on distance and speed, showcasing its reasoning process [5][6]. Group 3: Open Source and Community Engagement - The model has been open-sourced and is available for testing on platforms like Hugging Face, allowing users to experience its capabilities directly [14][15]. - The rapid community engagement is evident, with users already running the model in various forums and discussions, indicating a strong interest in its applications [16][17].

多模态模型

视觉语言模型

Artificial Intelligence

Qwen2.5-VL-32B-Instruct

多模态模型

视觉语言模型

Artificial Intelligence

Qwen2.5-VL-32B-Instruct