过程奖励模型 - filings, earnings calls, financial reports, news

过程奖励模型

Search documents

3 6 Ke· 2025-09-19 06:58

Core Insights - DreamPRM, developed by a research team from the University of California, San Diego, has achieved the top ranking on the MMMU (Massive Multi-discipline Multimodal Understanding) leaderboard, showcasing significant advancements in reasoning capabilities of large language models (LLMs) [1][18] - The introduction of the Process Reward Model (PRM) allows for supervision at intermediate steps in reasoning, enhancing the model's ability to select appropriate problem-solving paths [1] - DreamPRM-1.5 refines the weighting mechanism from domain-level to instance-level, enabling the model to leverage the potential value of each training sample [4][5] Model Architecture and Training Framework - DreamPRM-1.5 employs a dual-layer optimization framework, which dynamically adjusts sample weights based on reasoning performance, ensuring that the learning process is responsive to the effectiveness of the model [11][19] - Two complementary architectures, Instance Table and Instance Net, are designed for sample-level weighting: - Instance Table assigns independent weight parameters to each training sample, suitable for smaller datasets but challenging with larger ones due to parameter count [10] - Instance Net uses a small MLP network to predict weights, maintaining a fixed parameter count and better suited for large-scale training [10] Performance and Results - In experiments on the MMMU benchmark, DreamPRM-1.5 demonstrated superior accuracy, achieving 84.6% with the Instance Table and 83.6% with the Instance Net, significantly outperforming baseline models [15][16] - The model surpassed other top-performing models, including GPT-5 (84.2%) and Gemini 2.5 Pro Deep-Think (84.0%), indicating its effectiveness in multimodal reasoning tasks [18][20] Conclusion and Future Directions - The introduction of instance-level reweighting in multimodal reasoning training highlights the importance of data quality and its nuanced utilization in future model research [19][20] - Enhanced sample weighting and process scoring methods are anticipated to be key drivers in advancing multimodal reasoning capabilities [19]

告别数据「噪音」，UCSD大模型推理新方法DreamPRM充当「信号放大器」，登顶MathVista测评榜

机器之心· 2025-07-10 10:49

Core Viewpoint - DreamPRM, developed by a research team from the University of California, San Diego, has achieved the top position on the MathVista mathematical reasoning leaderboard, showcasing its significant advancements in multimodal reasoning capabilities [1][6][22]. Summary by Sections Introduction - DreamPRM utilizes a dual-layer optimization framework to enhance the reasoning abilities of multimodal large language models (MLLMs) by addressing challenges such as data quality imbalance and distribution shift [2][12]. Methodology - The core innovation of DreamPRM lies in constructing the training process of the process reward model (PRM) as a differentiable dual-layer optimization problem, dynamically adjusting domain weights to mitigate issues in multimodal reasoning [12][22]. - The lower optimization phase trains PRM parameters across 15 diverse training domains, assigning dynamic weights to reflect each domain's contribution to the overall loss function [13][14]. - The upper optimization phase employs a carefully constructed metadata set covering 30 disciplines and 183 subfields to evaluate the generalization capability of the PRM [12][14]. Performance Results - DreamPRM has demonstrated superior performance across five benchmark tests, consistently outperforming other PRM methods by 2-3% compared to the original PRM without data selection [16][22]. - The model, with only 8 billion parameters, outperformed larger closed-source models like GPT-4v and Gemini-1.5 in most benchmarks, indicating its strong reasoning capability [16][22]. - The accuracy of DreamPRM improves as the number of candidate reasoning chains (CoTs) increases, with performance enhancements observed when applied to stronger models like GPT-4.1-mini and o4-mini [19][20]. Conclusion - DreamPRM effectively addresses the challenges of data quality imbalance and distribution shift in training multimodal process reward models, achieving notable improvements in performance, particularly in complex mathematical reasoning tasks [22].