数学推理

Search documents
Qwen又立功,全球最快开源模型诞生,超2000 tokens/秒
3 6 Ke· 2025-09-10 12:19
Core Insights - The K2 Think model, developed by MBZUAI and G42 AI, is touted as the fastest open-source AI model, achieving a speed of over 2000 tokens per second, specifically 2730.4 tokens per second in tests [1][3][9] - K2 Think is claimed to be the most advanced open-source AI inference system to date, with a focus on mathematical reasoning [2][9] - The model is based on Qwen 2.5-32B and has been designed to excel in complex problem-solving through innovative training techniques [1][12] Performance Metrics - K2 Think has demonstrated consistent performance, maintaining speeds above 2000 tokens per second across various tests, including mathematical problems [3][7] - The model achieved notable scores in multiple mathematical benchmarks, such as 90.83 in AIME'24 and 81.24 in AIME'25 [9] Technical Innovations - The K2 Think team implemented six key innovations to enhance the model's capabilities: - Supervised Fine-Tuning (SFT) for structured reasoning [12] - Reinforcement Learning with Verifiable Rewards (RLVR) to improve performance in logic and mathematics [12] - Planning before reasoning to outline problem-solving strategies [12] - Best-of-N sampling to generate multiple answers and select the best [12] - Speculative Decoding for parallel answer generation and validation [12] - Hardware acceleration using Cerebras WSE for high-speed token generation [12] Safety and Security - The K2 Think team conducted comprehensive safety testing, ensuring robustness against harmful requests and information leaks [12]
14B打败671B,微软rStar2-Agent在数学推理上超过DeepSeek-R1
3 6 Ke· 2025-09-02 07:36
Core Insights - The article discusses the advancements in reasoning capabilities of large language models (LLMs) through test-time scaling and the introduction of agentic reinforcement learning, specifically highlighting the development of the rStar2-Agent model by a Microsoft research team [1][2]. Group 1: Model Development - Microsoft has developed a powerful agentic reinforcement learning method called rStar2-Agent, which includes a 14 billion parameter reasoning model that achieves state-of-the-art performance, surpassing even the 671 billion parameter DeepSeek-R1 model [2][17]. - The rStar2-Agent model demonstrates significant improvements in mathematical reasoning tasks, achieving an accuracy of 80.6% on the AIME24 benchmark, outperforming several leading models [19]. Group 2: Innovations and Techniques - The research team introduced three key innovations for the rStar2-Agent: 1. A high-throughput, independent code environment capable of handling 45,000 concurrent tool calls with an average feedback execution time of 0.3 seconds [10]. 2. A group relative policy optimization method (GRPO-RoC) that combines GRPO with correct resampling to address noise in the environment caused by sparse rewards [12][14]. 3. A training scheme that enhances a pre-trained 14 billion parameter model to achieve advanced reasoning capabilities with minimal computational resources [15][16]. Group 3: Performance Metrics - The rStar2-Agent-14B model achieved remarkable results in various reasoning benchmarks, including: - 80.6% accuracy on AIME24, 69.8% on AIME25, and 52.7% on HMMT25, demonstrating consistent high performance across tasks [19]. - It also outperformed DeepSeek-V3 in scientific reasoning benchmarks and showed competitive results in general alignment tests [22]. Group 4: Broader Implications - Despite being trained primarily on mathematical tasks, the rStar2-Agent model exhibits effective generalization capabilities, indicating its potential for broader applications in cognitive reasoning [21].
14B打败671B!微软rStar2-Agent在数学推理上超过DeepSeek-R1
机器之心· 2025-09-02 01:27
Core Viewpoint - The article discusses the advancements in large language models (LLMs) through the introduction of rStar2-Agent, a powerful agentic reinforcement learning method developed by Microsoft Research, which enhances reasoning capabilities and performance in mathematical reasoning tasks. Group 1: Model Development and Innovations - The rStar2-Agent model utilizes test-time scaling to enhance reasoning capabilities, allowing for longer and smarter thinking processes through the integration of advanced cognitive abilities and tool interactions [1][2]. - The model was trained using a 14 billion parameter architecture, achieving performance levels comparable to or exceeding that of larger models like DeepSeek-R1, which has 671 billion parameters [2][25]. - The training infrastructure developed for rStar2-Agent can handle 45,000 concurrent tool calls with an average feedback execution time of just 0.3 seconds, significantly improving training efficiency [14][13]. Group 2: Training Methodology - The team introduced a novel training scheme that begins with a non-reasoning supervised fine-tuning (SFT) phase, focusing on general instruction following and tool usage, which helps avoid overfitting and maintains shorter initial responses [21][19]. - The GRPO-RoC method was implemented to enhance the efficiency of active reinforcement learning in the coding environment, allowing for better handling of noise and improving the quality of training trajectories [19][18]. - The model achieved state-of-the-art mathematical reasoning performance with only 510 reinforcement learning steps, demonstrating exceptional training efficiency [23][25]. Group 3: Performance Metrics - rStar2-Agent-14B achieved an accuracy of 80.6% on the AIME24 benchmark, outperforming other models such as o3-mini, DeepSeek-R1, and Claude Opus 4.0 by margins of 1.0%, 0.8%, and 3.6% respectively [26]. - The model exhibited strong generalization capabilities beyond mathematics, despite being trained primarily on mathematical tasks [27]. - In terms of response length, rStar2-Agent-14B produced shorter average responses compared to larger models, indicating a more efficient reasoning process [29].
4B小模型数学推理首超Claude 4,700步RL训练逼近235B性能 | 港大&字节Seed&复旦
量子位· 2025-07-09 01:18
Core Viewpoint - The Polaris model, developed by a collaboration between the University of Hong Kong's NLP team, ByteDance Seed, and Fudan University, demonstrates superior mathematical reasoning capabilities compared to leading commercial models, achieving scores of 79.4 on AIME25 and 81.2 on AIME24 [1][53]. Group 1: Model Performance and Training - Polaris utilizes Scaling Reinforcement Learning (RL) to enhance the mathematical reasoning abilities of the 4B model, surpassing various commercial models such as Seed-1.5-thinking and Claude-4-Opus [1][5]. - The lightweight nature of Polaris-4B allows deployment on consumer-grade graphics cards [2]. - The research team confirmed that Scaling RL can replicate significant performance improvements in cutting-edge open-source models like Qwen3 [5]. Group 2: Training Data and Methodology - The success of Polaris hinges on tailored training data and hyperparameter settings that align with the model being trained [7]. - The team discovered a mirrored difficulty distribution in the training data, indicating that the same dataset presents varying challenges to models of different capabilities [8][10]. - A dynamic updating strategy for training data was implemented, allowing the model to adapt as it improves, ensuring that overly easy samples are removed during training [13]. Group 3: Sampling Diversity and Temperature Control - Diversity in sampling is crucial for enhancing model performance, allowing exploration of broader reasoning paths [14]. - The team identified that common temperature settings (0.6 and 1.0) were too low, limiting the model's exploration capabilities [27]. - A three-zone temperature framework was established: Robust Generation Zone, Controlled Exploration Zone, and Performance Collapse Zone, guiding the selection of optimal sampling temperatures [28]. Group 4: Long Context Training and Performance - The model's pre-training context length was limited to 32K, but during RL training, it was extended to 52K, addressing the challenge of long-context training [37]. - The introduction of length extrapolation techniques improved the accuracy of long text generation from 26% to over 50% [41]. - A multi-stage training approach was adopted, gradually increasing context window lengths to enhance reasoning capabilities [48]. Group 5: Evaluation and Results - Polaris achieved the highest performance in most evaluations, demonstrating its effectiveness in mathematical reasoning tasks [53].
高考数学斩获139分!小米7B模型比肩Qwen3-235B、OpenAI o3
机器之心· 2025-06-16 05:16
Core Viewpoint - The article discusses the performance of various AI models in the 2025 mathematics exam, highlighting the competitive landscape in AI model capabilities, particularly focusing on Xiaomi's MiMo-VL model which performed impressively despite its smaller parameter size [2][4][20]. Group 1: Model Performance - Gemini 2.5 Pro scored 145 points, ranking first, followed closely by Doubao and DeepSeek R1 with 144 points [2]. - MiMo-VL, a 7B parameter model, scored 139 points, matching Qwen3-235B and only one point lower than OpenAI's o3 [4]. - MiMo-VL outperformed Qwen2.5-VL-7B by 56 points, showcasing its superior capabilities despite having the same parameter size [5]. Group 2: Evaluation Methodology - MiMo-VL-7B and Qwen2.5-VL-7B were evaluated using uploaded question screenshots, while other models used text input [6]. - The evaluation included 14 objective questions (totaling 73 points) and 5 answer questions (totaling 77 points) [7]. Group 3: Detailed Scoring Breakdown - MiMo-VL scored 35 out of 40 in single-choice questions and achieved full marks in multiple-choice and fill-in-the-blank questions [8][10][11]. - In the answer questions, MiMo-VL scored 71 points, ranking fifth overall, surpassing hunyuan-t1-latest and 文心 X1 Turbo [12]. Group 4: Technological Advancements - Xiaomi announced the open-sourcing of its first inference-focused large model, MiMo, which has shown significant improvements in reasoning capabilities [14]. - MiMo-VL, as a successor to MiMo-7B, has demonstrated substantial advancements in multi-modal reasoning tasks, outperforming larger models like Qwen-2.5-VL-72B [20]. - The model's performance is attributed to high-quality pre-training data and an innovative mixed online reinforcement learning algorithm [27][29]. Group 5: Open Source and Accessibility - MiMo-VL-7B's technical report, model weights, and evaluation framework have been made open source, promoting transparency and accessibility in AI development [32].
32B本地部署!阿里开源最新多模态模型:主打视觉语言,数学推理也很强
量子位· 2025-03-25 00:59
Core Viewpoint - The article discusses the release of the Qwen2.5-VL-32B-Instruct model by Alibaba's Tongyi Qwen, highlighting its advancements in performance and capabilities compared to previous models and competitors. Group 1: Model Specifications - The Qwen2.5-VL family includes three sizes: 3B, 7B, and 72B, with the new 32B version balancing size and performance for local operation [2][3]. - The 32B version has undergone reinforcement learning optimization, achieving state-of-the-art (SOTA) performance in pure text capabilities, even surpassing the 72B model in several benchmarks [4]. Group 2: Performance Improvements - The Qwen2.5-VL-32B demonstrates enhanced mathematical reasoning abilities, image analysis, content recognition, and visual logic deduction, providing clearer and more human-like responses [5]. - For example, it can analyze a traffic sign image and accurately calculate travel time based on distance and speed, showcasing its reasoning process [5][6]. Group 3: Open Source and Community Engagement - The model has been open-sourced and is available for testing on platforms like Hugging Face, allowing users to experience its capabilities directly [14][15]. - The rapid community engagement is evident, with users already running the model in various forums and discussions, indicating a strong interest in its applications [16][17].