数学推理

Search documents
4B小模型数学推理首超Claude 4,700步RL训练逼近235B性能 | 港大&字节Seed&复旦
量子位· 2025-07-09 01:18
Core Viewpoint - The Polaris model, developed by a collaboration between the University of Hong Kong's NLP team, ByteDance Seed, and Fudan University, demonstrates superior mathematical reasoning capabilities compared to leading commercial models, achieving scores of 79.4 on AIME25 and 81.2 on AIME24 [1][53]. Group 1: Model Performance and Training - Polaris utilizes Scaling Reinforcement Learning (RL) to enhance the mathematical reasoning abilities of the 4B model, surpassing various commercial models such as Seed-1.5-thinking and Claude-4-Opus [1][5]. - The lightweight nature of Polaris-4B allows deployment on consumer-grade graphics cards [2]. - The research team confirmed that Scaling RL can replicate significant performance improvements in cutting-edge open-source models like Qwen3 [5]. Group 2: Training Data and Methodology - The success of Polaris hinges on tailored training data and hyperparameter settings that align with the model being trained [7]. - The team discovered a mirrored difficulty distribution in the training data, indicating that the same dataset presents varying challenges to models of different capabilities [8][10]. - A dynamic updating strategy for training data was implemented, allowing the model to adapt as it improves, ensuring that overly easy samples are removed during training [13]. Group 3: Sampling Diversity and Temperature Control - Diversity in sampling is crucial for enhancing model performance, allowing exploration of broader reasoning paths [14]. - The team identified that common temperature settings (0.6 and 1.0) were too low, limiting the model's exploration capabilities [27]. - A three-zone temperature framework was established: Robust Generation Zone, Controlled Exploration Zone, and Performance Collapse Zone, guiding the selection of optimal sampling temperatures [28]. Group 4: Long Context Training and Performance - The model's pre-training context length was limited to 32K, but during RL training, it was extended to 52K, addressing the challenge of long-context training [37]. - The introduction of length extrapolation techniques improved the accuracy of long text generation from 26% to over 50% [41]. - A multi-stage training approach was adopted, gradually increasing context window lengths to enhance reasoning capabilities [48]. Group 5: Evaluation and Results - Polaris achieved the highest performance in most evaluations, demonstrating its effectiveness in mathematical reasoning tasks [53].
高考数学斩获139分!小米7B模型比肩Qwen3-235B、OpenAI o3
机器之心· 2025-06-16 05:16
Core Viewpoint - The article discusses the performance of various AI models in the 2025 mathematics exam, highlighting the competitive landscape in AI model capabilities, particularly focusing on Xiaomi's MiMo-VL model which performed impressively despite its smaller parameter size [2][4][20]. Group 1: Model Performance - Gemini 2.5 Pro scored 145 points, ranking first, followed closely by Doubao and DeepSeek R1 with 144 points [2]. - MiMo-VL, a 7B parameter model, scored 139 points, matching Qwen3-235B and only one point lower than OpenAI's o3 [4]. - MiMo-VL outperformed Qwen2.5-VL-7B by 56 points, showcasing its superior capabilities despite having the same parameter size [5]. Group 2: Evaluation Methodology - MiMo-VL-7B and Qwen2.5-VL-7B were evaluated using uploaded question screenshots, while other models used text input [6]. - The evaluation included 14 objective questions (totaling 73 points) and 5 answer questions (totaling 77 points) [7]. Group 3: Detailed Scoring Breakdown - MiMo-VL scored 35 out of 40 in single-choice questions and achieved full marks in multiple-choice and fill-in-the-blank questions [8][10][11]. - In the answer questions, MiMo-VL scored 71 points, ranking fifth overall, surpassing hunyuan-t1-latest and 文心 X1 Turbo [12]. Group 4: Technological Advancements - Xiaomi announced the open-sourcing of its first inference-focused large model, MiMo, which has shown significant improvements in reasoning capabilities [14]. - MiMo-VL, as a successor to MiMo-7B, has demonstrated substantial advancements in multi-modal reasoning tasks, outperforming larger models like Qwen-2.5-VL-72B [20]. - The model's performance is attributed to high-quality pre-training data and an innovative mixed online reinforcement learning algorithm [27][29]. Group 5: Open Source and Accessibility - MiMo-VL-7B's technical report, model weights, and evaluation framework have been made open source, promoting transparency and accessibility in AI development [32].
32B本地部署!阿里开源最新多模态模型:主打视觉语言,数学推理也很强
量子位· 2025-03-25 00:59
西风 发自 凹非寺 量子位 | 公众号 QbitAI 就在DeepSeek-V3更新的同一夜,阿里通义千问Qwen又双叒叕一次梦幻联动了—— 发布 Qwen2.5-VL-32B-Instruct 。 此前开源家族视觉语言模型Qwen2.5-VL包括3B、7B和72B三种尺寸。 这一次的32B版本进一步兼顾尺寸和性能,可在本地运行。 同时经过强化学习优化,在三个方面改进显著: 对比近期开源的Mistral-Small-3.1-24B 、Gemma-3-27B-IT等, Qwen2.5-VL-32B在纯文本能力上也达到了同规模的SOTA表现。在多个基 准上,Qwen2.5-VL-32B甚至超过了72B。 举个栗子,比如根据一张交通指示牌照片,Qwen2.5-VL-32B就能做如下精细的图像理解和推理: 我正在这条路上驾驶一辆大卡车,现在12点了。我能在13点之前到达110公里远的地方吗? Qwen2.5-VL-32B首先对时间、距离、卡车限速进行分析,然后分步骤条理清晰推算出正确答案: 回答更符合人类偏好; 拥有更强的数学推理能力; 在图像解析、内容识别以及视觉逻辑推导等任务中,表现出更强的准确性和细粒度分析能力 ...