双层优化框架 - filings, earnings calls, financial reports, news

双层优化框架

Search documents

告别数据「噪音」，UCSD大模型推理新方法DreamPRM充当「信号放大器」，登顶MathVista测评榜

机器之心· 2025-07-10 10:49

Core Viewpoint - DreamPRM, developed by a research team from the University of California, San Diego, has achieved the top position on the MathVista mathematical reasoning leaderboard, showcasing its significant advancements in multimodal reasoning capabilities [1][6][22]. Summary by Sections Introduction - DreamPRM utilizes a dual-layer optimization framework to enhance the reasoning abilities of multimodal large language models (MLLMs) by addressing challenges such as data quality imbalance and distribution shift [2][12]. Methodology - The core innovation of DreamPRM lies in constructing the training process of the process reward model (PRM) as a differentiable dual-layer optimization problem, dynamically adjusting domain weights to mitigate issues in multimodal reasoning [12][22]. - The lower optimization phase trains PRM parameters across 15 diverse training domains, assigning dynamic weights to reflect each domain's contribution to the overall loss function [13][14]. - The upper optimization phase employs a carefully constructed metadata set covering 30 disciplines and 183 subfields to evaluate the generalization capability of the PRM [12][14]. Performance Results - DreamPRM has demonstrated superior performance across five benchmark tests, consistently outperforming other PRM methods by 2-3% compared to the original PRM without data selection [16][22]. - The model, with only 8 billion parameters, outperformed larger closed-source models like GPT-4v and Gemini-1.5 in most benchmarks, indicating its strong reasoning capability [16][22]. - The accuracy of DreamPRM improves as the number of candidate reasoning chains (CoTs) increases, with performance enhancements observed when applied to stronger models like GPT-4.1-mini and o4-mini [19][20]. Conclusion - DreamPRM effectively addresses the challenges of data quality imbalance and distribution shift in training multimodal process reward models, achieving notable improvements in performance, particularly in complex mathematical reasoning tasks [22].