多阶段强化学习 - filings, earnings calls, financial reports, news

多阶段强化学习

Search documents

量子位· 2025-11-22 03:07

Core Insights - The article discusses the achievements of the P1 model family developed by the Shanghai Artificial Intelligence Laboratory, particularly the P1-235B-A22B model, which has excelled in various physics competitions, including the International Physics Olympiad (IPhO) 2025, where it became the first open-source model to reach the gold medal threshold [1][3][37]. Group 1: Model Performance - P1-235B-A22B scored 21.2 out of 30 in the IPhO 2025 theoretical exam, ranking third overall, just behind Gemini-2.5-Pro and GPT-5 [3][37]. - In the HiPhO benchmark, which includes 13 top physics competitions, the average score of P1-235B-A22B improved from 35.9 to 38.4 after integrating the PhysicsMinions framework, surpassing Gemini-2.5-Pro (37.7) and GPT-5 (37.4) [5][38]. - In the Chinese Physics Olympiad (CPhO) 2025, P1-235B-A22B achieved a score of 227 out of 320, significantly higher than the human gold medalist's score of 199 [6][41]. Group 2: Training Methodology - The model was trained using a multi-stage reinforcement learning process, formalizing physics problem-solving as a sequential decision-making task [19][20]. - A high-quality dataset of 5,065 physics problems was constructed, including 4,126 from Olympiads and 939 from textbooks, covering five major fields and 25 subfields [11][13]. - The training utilized a novel Group Sequence Policy Optimization (GSPO) method to enhance learning efficiency and address the sparsity of rewards in physics problem-solving [20][23]. Group 3: Open Source and Collaboration - The entire process, from model architecture to evaluation datasets and the intelligent agent framework, has been made fully open-source [9]. - The PhysicsMinions framework, consisting of three interactive modules (Visual Studio, Logic Studio, and Review Studio), was designed to enhance the reasoning quality of the model [30][33]. - The collaborative approach within PhysicsMinions allows for continuous improvement of answers through a structured review process [30][33]. Group 4: Competitive Edge - P1-235B-A22B achieved 12 gold and 1 silver medal across 13 competitions, ranking it among the top models in the field [34][38]. - The lightweight model P1-30B-A3B also performed well, securing 8 gold, 4 silver, and 1 bronze medal, placing it third among open-source models [38].

RewardMap: 通过多阶段强化学习解决细粒度视觉推理的Sparse Reward

机器之心· 2025-10-21 03:43

Core Insights - The article discusses the development of RewardMap, a multi-stage reinforcement learning framework designed to enhance the fine-grained visual reasoning capabilities of multi-modal large language models (MLLMs) in complex scenarios like high-resolution subway maps [3][9][17]. Group 1: Problem Identification - Recent advancements in large language models (LLMs) and multi-modal large models (MLLMs) have raised questions about their ability to interpret complex visual information, particularly in high-resolution and densely structured environments [3]. - The initial work, ReasonMap, revealed that even state-of-the-art MLLMs frequently make errors in path planning, such as misreading lines, missing stations, and repeating routes [3][12]. Group 2: Proposed Solution - The team introduced RewardMap, which employs a multi-stage reinforcement learning framework that incorporates fine-grained rewards and a curriculum-based training approach to improve MLLMs' visual understanding and spatial reasoning [3][10]. - RewardMap breaks down complex route planning tasks into smaller, assessable sub-goals, allowing for a more nuanced feedback mechanism rather than a binary correct/incorrect signal [10][11]. Group 3: Implementation Details - RewardMap is built on the foundation of ReasonMap and includes a dataset covering 30 cities with 4,018 problem samples, categorized into five types to provide detailed supervision during the reinforcement learning phase [6][12]. - The framework's reward function consists of three components: format compliance, final correctness, and detail, with difficulty weights applied to reflect the true complexity of the tasks [11][12]. Group 4: Performance Results - RewardMap demonstrated consistent performance improvements across various benchmarks, achieving a maximum increase of 13.51% in the SpatialEval metric compared to traditional methods [13][14]. - Qualitative comparisons showed that models trained with RewardMap exhibited fewer visual confusions and hallucinations, providing more accurate route information [14][15]. Group 5: Future Outlook - The value of RewardMap extends beyond performance metrics, offering a reusable reinforcement learning paradigm for high-resolution visual tasks by systematically breaking down complex problems into measurable sub-goals [17]. - The framework's effectiveness in enhancing the general capabilities of multi-modal large models has been validated, indicating that real-world data like maps will play a significant role in future developments [18].