Workflow
多模态推理
icon
Search documents
感知错误率降低30.5%:隐式感知损失让模型主动“睁大眼睛” | UIUC&阿里通义
量子位· 2025-07-11 04:00
Core Viewpoint - The article discusses the introduction of a new reinforcement learning algorithm called PAPO (Perception-Aware Policy Optimization) developed by the University of Illinois Urbana-Champaign and Alibaba's Tongyi Laboratory, which focuses on enhancing multimodal reasoning by integrating perception into the learning process [1][3]. Group 1: Introduction of PAPO - PAPO aims to address the limitations of existing reinforcement learning algorithms like GRPO, which excel in text reasoning but struggle with multimodal scenarios due to inadequate visual information utilization [2][3]. - The algorithm introduces an innovative implicit perception loss design that relies on internal supervisory signals, allowing multimodal models to learn perception alongside reasoning [3][6]. Group 2: Error Analysis and Findings - A systematic error analysis revealed that the primary issue in multimodal reasoning is the accuracy of visual perception, rather than logical reasoning capabilities [6][7]. - The analysis of 200 error cases from the Qwen2.5-VL-3B model trained with GRPO showed that 67% of errors were due to perception inaccuracies, while only 18% were due to reasoning errors [9][14]. Group 3: Technical Innovations of PAPO - PAPO's core innovation includes the design of a perception information gain ratio and the maximization of KL divergence to encourage different output distributions for original and damaged images [19][20]. - The complete objective function for PAPO is presented as a simple extension of GRPO, integrating the KL divergence term [21]. Group 4: Experimental Validation - Comprehensive evaluations on eight multimodal reasoning benchmarks demonstrated that PAPO consistently outperformed GRPO, achieving an overall average improvement of 4.4% and a significant 30.5% reduction in perception errors [26][28]. - PAPO exhibited faster convergence and more stable training dynamics compared to GRPO, starting to show improvements as early as 25 training steps [29][30]. Group 5: Visual Dependency Analysis - The analysis of visual dependency in mainstream multimodal reasoning benchmarks indicated that many tasks contain substantial visual information, allowing models to answer correctly without visual input [50][51]. - PAPO showed the most significant improvements in high-visual-dependency tasks, with nearly an 8% enhancement, while maintaining consistent improvements across medium and low-dependency tasks [53][54]. Group 6: Practical Applications - Several practical application cases illustrate PAPO's effectiveness in complex geometric problems, such as accurately calculating relationships in right triangles and distinguishing between different objects [55][63][64].
告别数据「噪音」,UCSD大模型推理新方法DreamPRM充当「信号放大器」,登顶MathVista测评榜
机器之心· 2025-07-10 10:49
Core Viewpoint - DreamPRM, developed by a research team from the University of California, San Diego, has achieved the top position on the MathVista mathematical reasoning leaderboard, showcasing its significant advancements in multimodal reasoning capabilities [1][6][22]. Summary by Sections Introduction - DreamPRM utilizes a dual-layer optimization framework to enhance the reasoning abilities of multimodal large language models (MLLMs) by addressing challenges such as data quality imbalance and distribution shift [2][12]. Methodology - The core innovation of DreamPRM lies in constructing the training process of the process reward model (PRM) as a differentiable dual-layer optimization problem, dynamically adjusting domain weights to mitigate issues in multimodal reasoning [12][22]. - The lower optimization phase trains PRM parameters across 15 diverse training domains, assigning dynamic weights to reflect each domain's contribution to the overall loss function [13][14]. - The upper optimization phase employs a carefully constructed metadata set covering 30 disciplines and 183 subfields to evaluate the generalization capability of the PRM [12][14]. Performance Results - DreamPRM has demonstrated superior performance across five benchmark tests, consistently outperforming other PRM methods by 2-3% compared to the original PRM without data selection [16][22]. - The model, with only 8 billion parameters, outperformed larger closed-source models like GPT-4v and Gemini-1.5 in most benchmarks, indicating its strong reasoning capability [16][22]. - The accuracy of DreamPRM improves as the number of candidate reasoning chains (CoTs) increases, with performance enhancements observed when applied to stronger models like GPT-4.1-mini and o4-mini [19][20]. Conclusion - DreamPRM effectively addresses the challenges of data quality imbalance and distribution shift in training multimodal process reward models, achieving notable improvements in performance, particularly in complex mathematical reasoning tasks [22].
只训练数学,却在物理化学生物战胜o1!新强化学习算法带来显著性能提升,还缓解训练崩溃问题
量子位· 2025-06-23 04:45
Core Viewpoint - The article discusses the introduction of a new reinforcement learning algorithm, CPGD (Clipped Policy Gradient Optimization with Policy Drift), which significantly enhances model stability and performance in multi-modal reasoning tasks, outperforming traditional algorithms like GRPO and RLOO [1][6][11]. Group 1: Algorithm Development - CPGD algorithm alleviates training instability and improves performance, achieving an average performance increase of 11% over models trained with GRPO [1][14]. - The MM-Eureka-CPGD-7B model shows a 21.8% improvement on the MMK12 test set compared to the base model QwenVL2.5-7B, demonstrating superior generalization capabilities [1][14]. - The new algorithm introduces a logarithmic treatment of policy ratios and a policy drift term to stabilize training and control policy changes, proving more effective than existing methods [8][11]. Group 2: Model Performance - The MM-Eureka-CPGD-32B model surpasses the o1 model in various subjects, despite being trained solely on mathematical datasets [2][14]. - The MM-Eureka series has gained significant attention, with over 10,000 downloads and nearly 100 citations since its release [3][14]. - Performance metrics indicate that MM-Eureka-CPGD-7B outperforms leading models like OpenAI-o1 and GPT-4o across multiple datasets [13][15]. Group 3: Data and Framework - The MMK12 dataset, containing over 15,000 multi-modal math reasoning questions, addresses issues of single-type questions and inaccurate answers, becoming a key benchmark in multi-modal reasoning tasks [16][17]. - The multi-modal reinforcement learning framework built on OpenRLHF supports various models and algorithms, enhancing scalability and stability for large-scale training [4][5]. - The MM-PRM (Multi-modal Process Reward Model) focuses on the reasoning process, providing a structured approach to evaluate and guide model inference [18][21]. Group 4: Future Directions - The combination of PRM and reinforcement learning is seen as a promising area for further exploration, aiming to enhance model robustness and interpretability in complex reasoning tasks [22][24]. - The company plans to continue advancing multi-modal reasoning training and systematic optimization, inviting community participation in the development [25].
统一框架下的具身多模态推理:自变量机器人让AI放下海德格尔的锤子
机器之心· 2025-06-18 06:09
Core Viewpoint - The article emphasizes the need for a paradigm shift in robotics from modular systems to a unified architecture that enables embodied intelligence, allowing robots to process perception, reasoning, and action simultaneously, akin to human cognition [4][10][34]. Current Paradigm Limitations - Existing mainstream methods treat different modalities as independent modules, leading to inherent flaws in information processing and understanding [6][7]. - The representation bottleneck results in unavoidable compression losses when transferring information between different modality encoders, hindering deep cross-modal understanding of the physical world [7]. - The structural disconnection prevents models from learning intuitive causal relationships across modalities, which is essential for true physical intelligence [8]. Unified Architecture: From Division to Integration - The proposed unified modality architecture aims to eliminate artificial boundaries between visual, linguistic, and action modalities, processing them as a single information flow [4][10]. - The core of this architecture is unified representation learning, converting all modality information into a shared high-dimensional token sequence [11]. - A multi-task, multi-modal generation mechanism serves as a supervisory method, compelling the model to establish deep cross-modal correspondences [12]. Emergent Capabilities: Embodied Multi-Modal Reasoning - The unified architecture unlocks comprehensive embodied multi-modal reasoning capabilities that current modular systems cannot achieve [16]. - Symbol-space reasoning allows robots to deconstruct abstract shapes into concrete representations and perform physical operations based on this understanding [17]. - Physical space reasoning enables robots to understand the implications of actions on structural stability and articulate their reasoning processes [19][20]. - The system can autonomously explore complex environments by integrating visual observations, spatial memory, and common knowledge into coherent reasoning chains [22]. Conclusion - The transition to a unified architecture is crucial for enabling robots to interact seamlessly with the physical world, integrating perception, understanding, and action without the delays and losses associated with modular systems [30][31]. - This shift is not merely incremental but represents a fundamental evolution necessary for achieving embodied intelligence capable of cross-modal causal reasoning and spatial logic [34].
高考数学斩获139分!小米7B模型比肩Qwen3-235B、OpenAI o3
机器之心· 2025-06-16 05:16
Core Viewpoint - The article discusses the performance of various AI models in the 2025 mathematics exam, highlighting the competitive landscape in AI model capabilities, particularly focusing on Xiaomi's MiMo-VL model which performed impressively despite its smaller parameter size [2][4][20]. Group 1: Model Performance - Gemini 2.5 Pro scored 145 points, ranking first, followed closely by Doubao and DeepSeek R1 with 144 points [2]. - MiMo-VL, a 7B parameter model, scored 139 points, matching Qwen3-235B and only one point lower than OpenAI's o3 [4]. - MiMo-VL outperformed Qwen2.5-VL-7B by 56 points, showcasing its superior capabilities despite having the same parameter size [5]. Group 2: Evaluation Methodology - MiMo-VL-7B and Qwen2.5-VL-7B were evaluated using uploaded question screenshots, while other models used text input [6]. - The evaluation included 14 objective questions (totaling 73 points) and 5 answer questions (totaling 77 points) [7]. Group 3: Detailed Scoring Breakdown - MiMo-VL scored 35 out of 40 in single-choice questions and achieved full marks in multiple-choice and fill-in-the-blank questions [8][10][11]. - In the answer questions, MiMo-VL scored 71 points, ranking fifth overall, surpassing hunyuan-t1-latest and 文心 X1 Turbo [12]. Group 4: Technological Advancements - Xiaomi announced the open-sourcing of its first inference-focused large model, MiMo, which has shown significant improvements in reasoning capabilities [14]. - MiMo-VL, as a successor to MiMo-7B, has demonstrated substantial advancements in multi-modal reasoning tasks, outperforming larger models like Qwen-2.5-VL-72B [20]. - The model's performance is attributed to high-quality pre-training data and an innovative mixed online reinforcement learning algorithm [27][29]. Group 5: Open Source and Accessibility - MiMo-VL-7B's technical report, model weights, and evaluation framework have been made open source, promoting transparency and accessibility in AI development [32].
专访张祥雨:多模态推理和自主学习是未来的 2 个 「GPT-4」 时刻
海外独角兽· 2025-06-09 04:23
本期内容是拾象 CEO 李广密对大模型公司阶跃星辰首席科学家张祥雨的访谈, 首发于「张小珺商业 访谈录」。 张祥雨专注于多模态领域,他提出了 DreamLLM 多模态大模型框架,这是业内最早的图文生成理解 一体化的多模态大模型架构之一,基于这个框架,阶跃星辰发布了中国首个千亿参数原生多模态大 模型 Step-1V。此外,他的学术影响力相当突出,论文总引用量已经超过了 37 万次。 一直以来,业界都相当期待一个理解、生成一体化的多模态,但直到今天这个模型还没出现,如何 才能达到多模态领域的 GPT-4 时刻?这一期对谈中,祥雨结合自己在多模态领域的研究和实践历 程,从纯粹的技术视角下分享了自己对多模态领域关键问题的全新思考,在他看来,虽然语言模型 领域的进步极快,但多模态生成和理解的难度被低估了: • 接下来 2-3 年,多模态领域会有两个 GPT-4 时刻:多模态推理和自主学习; • 多模态生成理解一体化难以实现的原因在于,语言对视觉的控制能力弱,图文对齐不精确,数据质 量有限,生成模块往往无法反向影响理解模块等; • 模型 scale 到万亿参数后,在文本生成和知识问答能力增强的同时,推理能力,尤其是数学, ...
专访张祥雨:多模态推理和自主学习是未来的 2 个 「GPT-4」 时刻
海外独角兽· 2025-06-08 04:51
本期内容是拾象 CEO 李广密对大模型公司阶跃星辰首席科学家张祥雨的访谈。 张祥雨专注于多模态领域,他提出了 DreamLLM 多模态大模型框架,这是业内最早的图文生成理解 一体化的多模态大模型架构之一,基于这个框架,阶跃星辰发布了中国首个千亿参数原生多模态大 模型 Step-1V。此外,他的学术影响力相当突出,论文总引用量已经超过了 37 万次。 一直以来,业界都相当期待一个理解、生成一体化的多模态,但直到今天这个模型还没出现,如何 才能达到多模态领域的 GPT-4 时刻?这一期对谈中,祥雨结合自己在多模态领域的研究和实践历 程,从纯粹的技术视角下分享了自己对多模态领域关键问题的全新思考,在他看来,虽然语言模型 领域的进步极快,但多模态生成和理解的难度被低估了: • 接下来 2-3 年,多模态领域会有两个 GPT-4 时刻:多模态推理和自主学习; • o1 范式的技术本质在于激发出 Meta CoT 思维链:允许模型在关键节点反悔、重试、选择不同分 支,使推理过程从单线变为图状结构。 目录 01 研究主线: 重新回归大模型 • 多模态生成理解一体化难以实现的原因在于,语言对视觉的控制能力弱,图文对齐不精确, ...
多模态推理新基准!最强Gemini 2.5 Pro仅得60分,复旦港中文上海AILab等出品
量子位· 2025-06-06 13:45
MME团队 投稿 量子位 | 公众号 QbitAI 逻辑推理是人类智能的核心能力,也是多模态大语言模型 (MLLMs) 的关键能力。随着DeepSeek-R1等具备强大推理能力的LLM的出现,研 究人员开始探索如何将推理能力引入多模态大模型(MLLMs)。 然而,现有的benchmark大多缺乏对逻辑推理类型的明确分类,以及对逻辑推理的理解不够清晰,常将感知能力或知识广度与推理能力混 淆。 在此背景下,复旦大学及香港中文大学MMLab联合上海人工智能实验室等多家单位,提出了MME-Reasoning,旨在全面的评估多模态大模 型的推理能力。 结果显示,最优模型得分仅60%左右。 MME-Reasoning:全面评估多模态推理能力 根据Charles Sanders Peirce的分类标准,推理分为三类:演绎推理 (Deductive)、归纳推理 (Inductive) 以及溯因推理 (Abductive)。 MME-Reasoning以此分类作为标准来全面的测评多模态大模型的推理能力。 演绎推理 (Deductive reasoning) 使用规则和前提来推导出结论。 归纳推理 (Inductive reas ...
首个多模态专用慢思考框架!超GPT-o1近7个百分点,强化学习教会VLM「三思而后行」
量子位· 2025-06-06 13:45
Core Insights - The article discusses the limitations of "slow thinking" models like GPT-o1 and DeepSeek-R1 in multi-modal reasoning scenarios compared to "fast thinking" models like GPT-4o, highlighting that these slow thinking models perform similarly or worse in certain benchmarks [1][2]. Group 1: Challenges in Multi-Modal Reasoning - The research identifies two main challenges in developing slow thinking capabilities in visual language models (VLM): "vanishing advantages" and "reflective inertia" [2][3]. - "Vanishing advantages" occurs when all responses to a query receive the same reward, leading to a significant increase in zero-advantage samples during training, which hampers the model's learning [3][4]. - Reflective inertia in VLMs is attributed to their reliance on visual perception and a lack of diverse reflective patterns in pre-training data, making them less capable of engaging in deep reasoning processes [5][6]. Group 2: VL-Rethinker Framework - To address the challenges of limited high-quality training data, the research team developed the ViRL39K dataset, which includes 38,870 high-quality multi-modal reasoning questions across eight themes [7][8][9]. - The VL-Rethinker framework incorporates two key innovations: Selective Sample Replay (SSR) and Forced Rethinking [17]. - SSR focuses on dynamically storing and replaying high-value training samples to mitigate the vanishing advantages issue, enhancing training efficiency [19][20]. - Forced Rethinking introduces a mechanism to trigger a second reasoning process after the model generates an initial response, promoting diverse reflective behaviors [21][25]. Group 3: Experimental Results - The VL-Rethinker model achieved significant improvements in multi-modal reasoning tasks, outperforming the GPT-o1 model in MathVista (80.4% vs. 73.4%) and MathVerse (63.5% vs. 57.0%) [27]. - In multi-disciplinary understanding tests, VL-Rethinker achieved 55.9% on MMMU-Pro and 38.5% on EMMA, setting new state-of-the-art performance levels [28]. - The iterative improvements of the VL-Rethinker model demonstrated the effectiveness of SSR and the potential of slow thinking in multi-modal contexts, with notable performance gains over the base model Qwen2.5-VL-72B [29].
券商晨会精华:低估值具身智能应用标的和红利资产继续受青睐
Xin Lang Cai Jing· 2025-06-03 00:49
Group 1 - The market experienced fluctuations with the ChiNext index leading the decline, while sectors such as pork, innovative drugs, banks, and CROs saw gains, and sectors like gold, glyphosate, controllable nuclear fusion, humanoid robots, environmental equipment, and consumer electronics faced losses [1] - CITIC Securities highlighted that low-valued embodied intelligent application targets and dividend assets continue to attract market interest, suggesting a focus on "AI + robotics" investment opportunities beyond humanoid robots [2] - CICC emphasized that multi-modal reasoning is crucial for enhancing intelligent driving capabilities, with significant advancements expected in the algorithms of leading smart driving companies [2] Group 2 - Huatai Securities pointed out that core assets like A50 and major financial sectors are likely to shift from resilience revaluation to growth revaluation, showing strong fundamentals during the real estate investment cycle adjustment [3] - A50 non-financial ROE is expected to stabilize and recover ahead of the overall non-financial sector, driven by cost improvements and shareholder returns [3] - The current valuation of these companies reflects a higher implied cost of equity than the market average, indicating potential for a significant reduction in risk premium if investors reassess the overlooked growth resilience [3]