4B小模型数学推理首超Claude 4，700步RL训练逼近235B性能

Core Viewpoint - The Polaris model, developed by a collaboration between the University of Hong Kong's NLP team, ByteDance Seed, and Fudan University, demonstrates superior mathematical reasoning capabilities compared to leading commercial models, achieving scores of 79.4 on AIME25 and 81.2 on AIME24 [1][53]. Group 1: Model Performance and Training - Polaris utilizes Scaling Reinforcement Learning (RL) to enhance the mathematical reasoning abilities of the 4B model, surpassing various commercial models such as Seed-1.5-thinking and Claude-4-Opus [1][5]. - The lightweight nature of Polaris-4B allows deployment on consumer-grade graphics cards [2]. - The research team confirmed that Scaling RL can replicate significant performance improvements in cutting-edge open-source models like Qwen3 [5]. Group 2: Training Data and Methodology - The success of Polaris hinges on tailored training data and hyperparameter settings that align with the model being trained [7]. - The team discovered a mirrored difficulty distribution in the training data, indicating that the same dataset presents varying challenges to models of different capabilities [8][10]. - A dynamic updating strategy for training data was implemented, allowing the model to adapt as it improves, ensuring that overly easy samples are removed during training [13]. Group 3: Sampling Diversity and Temperature Control - Diversity in sampling is crucial for enhancing model performance, allowing exploration of broader reasoning paths [14]. - The team identified that common temperature settings (0.6 and 1.0) were too low, limiting the model's exploration capabilities [27]. - A three-zone temperature framework was established: Robust Generation Zone, Controlled Exploration Zone, and Performance Collapse Zone, guiding the selection of optimal sampling temperatures [28]. Group 4: Long Context Training and Performance - The model's pre-training context length was limited to 32K, but during RL training, it was extended to 52K, addressing the challenge of long-context training [37]. - The introduction of length extrapolation techniques improved the accuracy of long text generation from 26% to over 50% [41]. - A multi-stage training approach was adopted, gradually increasing context window lengths to enhance reasoning capabilities [48]. Group 5: Evaluation and Results - Polaris achieved the highest performance in most evaluations, demonstrating its effectiveness in mathematical reasoning tasks [53].