多模态推理

Search documents
只训练数学,却在物理化学生物战胜o1!新强化学习算法带来显著性能提升,还缓解训练崩溃问题
量子位· 2025-06-23 04:45
Core Viewpoint - The article discusses the introduction of a new reinforcement learning algorithm, CPGD (Clipped Policy Gradient Optimization with Policy Drift), which significantly enhances model stability and performance in multi-modal reasoning tasks, outperforming traditional algorithms like GRPO and RLOO [1][6][11]. Group 1: Algorithm Development - CPGD algorithm alleviates training instability and improves performance, achieving an average performance increase of 11% over models trained with GRPO [1][14]. - The MM-Eureka-CPGD-7B model shows a 21.8% improvement on the MMK12 test set compared to the base model QwenVL2.5-7B, demonstrating superior generalization capabilities [1][14]. - The new algorithm introduces a logarithmic treatment of policy ratios and a policy drift term to stabilize training and control policy changes, proving more effective than existing methods [8][11]. Group 2: Model Performance - The MM-Eureka-CPGD-32B model surpasses the o1 model in various subjects, despite being trained solely on mathematical datasets [2][14]. - The MM-Eureka series has gained significant attention, with over 10,000 downloads and nearly 100 citations since its release [3][14]. - Performance metrics indicate that MM-Eureka-CPGD-7B outperforms leading models like OpenAI-o1 and GPT-4o across multiple datasets [13][15]. Group 3: Data and Framework - The MMK12 dataset, containing over 15,000 multi-modal math reasoning questions, addresses issues of single-type questions and inaccurate answers, becoming a key benchmark in multi-modal reasoning tasks [16][17]. - The multi-modal reinforcement learning framework built on OpenRLHF supports various models and algorithms, enhancing scalability and stability for large-scale training [4][5]. - The MM-PRM (Multi-modal Process Reward Model) focuses on the reasoning process, providing a structured approach to evaluate and guide model inference [18][21]. Group 4: Future Directions - The combination of PRM and reinforcement learning is seen as a promising area for further exploration, aiming to enhance model robustness and interpretability in complex reasoning tasks [22][24]. - The company plans to continue advancing multi-modal reasoning training and systematic optimization, inviting community participation in the development [25].
统一框架下的具身多模态推理:自变量机器人让AI放下海德格尔的锤子
机器之心· 2025-06-18 06:09
Core Viewpoint - The article emphasizes the need for a paradigm shift in robotics from modular systems to a unified architecture that enables embodied intelligence, allowing robots to process perception, reasoning, and action simultaneously, akin to human cognition [4][10][34]. Current Paradigm Limitations - Existing mainstream methods treat different modalities as independent modules, leading to inherent flaws in information processing and understanding [6][7]. - The representation bottleneck results in unavoidable compression losses when transferring information between different modality encoders, hindering deep cross-modal understanding of the physical world [7]. - The structural disconnection prevents models from learning intuitive causal relationships across modalities, which is essential for true physical intelligence [8]. Unified Architecture: From Division to Integration - The proposed unified modality architecture aims to eliminate artificial boundaries between visual, linguistic, and action modalities, processing them as a single information flow [4][10]. - The core of this architecture is unified representation learning, converting all modality information into a shared high-dimensional token sequence [11]. - A multi-task, multi-modal generation mechanism serves as a supervisory method, compelling the model to establish deep cross-modal correspondences [12]. Emergent Capabilities: Embodied Multi-Modal Reasoning - The unified architecture unlocks comprehensive embodied multi-modal reasoning capabilities that current modular systems cannot achieve [16]. - Symbol-space reasoning allows robots to deconstruct abstract shapes into concrete representations and perform physical operations based on this understanding [17]. - Physical space reasoning enables robots to understand the implications of actions on structural stability and articulate their reasoning processes [19][20]. - The system can autonomously explore complex environments by integrating visual observations, spatial memory, and common knowledge into coherent reasoning chains [22]. Conclusion - The transition to a unified architecture is crucial for enabling robots to interact seamlessly with the physical world, integrating perception, understanding, and action without the delays and losses associated with modular systems [30][31]. - This shift is not merely incremental but represents a fundamental evolution necessary for achieving embodied intelligence capable of cross-modal causal reasoning and spatial logic [34].
高考数学斩获139分!小米7B模型比肩Qwen3-235B、OpenAI o3
机器之心· 2025-06-16 05:16
Core Viewpoint - The article discusses the performance of various AI models in the 2025 mathematics exam, highlighting the competitive landscape in AI model capabilities, particularly focusing on Xiaomi's MiMo-VL model which performed impressively despite its smaller parameter size [2][4][20]. Group 1: Model Performance - Gemini 2.5 Pro scored 145 points, ranking first, followed closely by Doubao and DeepSeek R1 with 144 points [2]. - MiMo-VL, a 7B parameter model, scored 139 points, matching Qwen3-235B and only one point lower than OpenAI's o3 [4]. - MiMo-VL outperformed Qwen2.5-VL-7B by 56 points, showcasing its superior capabilities despite having the same parameter size [5]. Group 2: Evaluation Methodology - MiMo-VL-7B and Qwen2.5-VL-7B were evaluated using uploaded question screenshots, while other models used text input [6]. - The evaluation included 14 objective questions (totaling 73 points) and 5 answer questions (totaling 77 points) [7]. Group 3: Detailed Scoring Breakdown - MiMo-VL scored 35 out of 40 in single-choice questions and achieved full marks in multiple-choice and fill-in-the-blank questions [8][10][11]. - In the answer questions, MiMo-VL scored 71 points, ranking fifth overall, surpassing hunyuan-t1-latest and 文心 X1 Turbo [12]. Group 4: Technological Advancements - Xiaomi announced the open-sourcing of its first inference-focused large model, MiMo, which has shown significant improvements in reasoning capabilities [14]. - MiMo-VL, as a successor to MiMo-7B, has demonstrated substantial advancements in multi-modal reasoning tasks, outperforming larger models like Qwen-2.5-VL-72B [20]. - The model's performance is attributed to high-quality pre-training data and an innovative mixed online reinforcement learning algorithm [27][29]. Group 5: Open Source and Accessibility - MiMo-VL-7B's technical report, model weights, and evaluation framework have been made open source, promoting transparency and accessibility in AI development [32].
专访张祥雨:多模态推理和自主学习是未来的 2 个 「GPT-4」 时刻
海外独角兽· 2025-06-09 04:23
本期内容是拾象 CEO 李广密对大模型公司阶跃星辰首席科学家张祥雨的访谈, 首发于「张小珺商业 访谈录」。 张祥雨专注于多模态领域,他提出了 DreamLLM 多模态大模型框架,这是业内最早的图文生成理解 一体化的多模态大模型架构之一,基于这个框架,阶跃星辰发布了中国首个千亿参数原生多模态大 模型 Step-1V。此外,他的学术影响力相当突出,论文总引用量已经超过了 37 万次。 一直以来,业界都相当期待一个理解、生成一体化的多模态,但直到今天这个模型还没出现,如何 才能达到多模态领域的 GPT-4 时刻?这一期对谈中,祥雨结合自己在多模态领域的研究和实践历 程,从纯粹的技术视角下分享了自己对多模态领域关键问题的全新思考,在他看来,虽然语言模型 领域的进步极快,但多模态生成和理解的难度被低估了: • 接下来 2-3 年,多模态领域会有两个 GPT-4 时刻:多模态推理和自主学习; • 多模态生成理解一体化难以实现的原因在于,语言对视觉的控制能力弱,图文对齐不精确,数据质 量有限,生成模块往往无法反向影响理解模块等; • 模型 scale 到万亿参数后,在文本生成和知识问答能力增强的同时,推理能力,尤其是数学, ...
专访张祥雨:多模态推理和自主学习是未来的 2 个 「GPT-4」 时刻
海外独角兽· 2025-06-08 04:51
本期内容是拾象 CEO 李广密对大模型公司阶跃星辰首席科学家张祥雨的访谈。 张祥雨专注于多模态领域,他提出了 DreamLLM 多模态大模型框架,这是业内最早的图文生成理解 一体化的多模态大模型架构之一,基于这个框架,阶跃星辰发布了中国首个千亿参数原生多模态大 模型 Step-1V。此外,他的学术影响力相当突出,论文总引用量已经超过了 37 万次。 一直以来,业界都相当期待一个理解、生成一体化的多模态,但直到今天这个模型还没出现,如何 才能达到多模态领域的 GPT-4 时刻?这一期对谈中,祥雨结合自己在多模态领域的研究和实践历 程,从纯粹的技术视角下分享了自己对多模态领域关键问题的全新思考,在他看来,虽然语言模型 领域的进步极快,但多模态生成和理解的难度被低估了: • 接下来 2-3 年,多模态领域会有两个 GPT-4 时刻:多模态推理和自主学习; • o1 范式的技术本质在于激发出 Meta CoT 思维链:允许模型在关键节点反悔、重试、选择不同分 支,使推理过程从单线变为图状结构。 目录 01 研究主线: 重新回归大模型 • 多模态生成理解一体化难以实现的原因在于,语言对视觉的控制能力弱,图文对齐不精确, ...
多模态推理新基准!最强Gemini 2.5 Pro仅得60分,复旦港中文上海AILab等出品
量子位· 2025-06-06 13:45
MME团队 投稿 量子位 | 公众号 QbitAI 逻辑推理是人类智能的核心能力,也是多模态大语言模型 (MLLMs) 的关键能力。随着DeepSeek-R1等具备强大推理能力的LLM的出现,研 究人员开始探索如何将推理能力引入多模态大模型(MLLMs)。 然而,现有的benchmark大多缺乏对逻辑推理类型的明确分类,以及对逻辑推理的理解不够清晰,常将感知能力或知识广度与推理能力混 淆。 在此背景下,复旦大学及香港中文大学MMLab联合上海人工智能实验室等多家单位,提出了MME-Reasoning,旨在全面的评估多模态大模 型的推理能力。 结果显示,最优模型得分仅60%左右。 MME-Reasoning:全面评估多模态推理能力 根据Charles Sanders Peirce的分类标准,推理分为三类:演绎推理 (Deductive)、归纳推理 (Inductive) 以及溯因推理 (Abductive)。 MME-Reasoning以此分类作为标准来全面的测评多模态大模型的推理能力。 演绎推理 (Deductive reasoning) 使用规则和前提来推导出结论。 归纳推理 (Inductive reas ...
首个多模态专用慢思考框架!超GPT-o1近7个百分点,强化学习教会VLM「三思而后行」
量子位· 2025-06-06 13:45
Core Insights - The article discusses the limitations of "slow thinking" models like GPT-o1 and DeepSeek-R1 in multi-modal reasoning scenarios compared to "fast thinking" models like GPT-4o, highlighting that these slow thinking models perform similarly or worse in certain benchmarks [1][2]. Group 1: Challenges in Multi-Modal Reasoning - The research identifies two main challenges in developing slow thinking capabilities in visual language models (VLM): "vanishing advantages" and "reflective inertia" [2][3]. - "Vanishing advantages" occurs when all responses to a query receive the same reward, leading to a significant increase in zero-advantage samples during training, which hampers the model's learning [3][4]. - Reflective inertia in VLMs is attributed to their reliance on visual perception and a lack of diverse reflective patterns in pre-training data, making them less capable of engaging in deep reasoning processes [5][6]. Group 2: VL-Rethinker Framework - To address the challenges of limited high-quality training data, the research team developed the ViRL39K dataset, which includes 38,870 high-quality multi-modal reasoning questions across eight themes [7][8][9]. - The VL-Rethinker framework incorporates two key innovations: Selective Sample Replay (SSR) and Forced Rethinking [17]. - SSR focuses on dynamically storing and replaying high-value training samples to mitigate the vanishing advantages issue, enhancing training efficiency [19][20]. - Forced Rethinking introduces a mechanism to trigger a second reasoning process after the model generates an initial response, promoting diverse reflective behaviors [21][25]. Group 3: Experimental Results - The VL-Rethinker model achieved significant improvements in multi-modal reasoning tasks, outperforming the GPT-o1 model in MathVista (80.4% vs. 73.4%) and MathVerse (63.5% vs. 57.0%) [27]. - In multi-disciplinary understanding tests, VL-Rethinker achieved 55.9% on MMMU-Pro and 38.5% on EMMA, setting new state-of-the-art performance levels [28]. - The iterative improvements of the VL-Rethinker model demonstrated the effectiveness of SSR and the potential of slow thinking in multi-modal contexts, with notable performance gains over the base model Qwen2.5-VL-72B [29].
券商晨会精华:低估值具身智能应用标的和红利资产继续受青睐
Xin Lang Cai Jing· 2025-06-03 00:49
Group 1 - The market experienced fluctuations with the ChiNext index leading the decline, while sectors such as pork, innovative drugs, banks, and CROs saw gains, and sectors like gold, glyphosate, controllable nuclear fusion, humanoid robots, environmental equipment, and consumer electronics faced losses [1] - CITIC Securities highlighted that low-valued embodied intelligent application targets and dividend assets continue to attract market interest, suggesting a focus on "AI + robotics" investment opportunities beyond humanoid robots [2] - CICC emphasized that multi-modal reasoning is crucial for enhancing intelligent driving capabilities, with significant advancements expected in the algorithms of leading smart driving companies [2] Group 2 - Huatai Securities pointed out that core assets like A50 and major financial sectors are likely to shift from resilience revaluation to growth revaluation, showing strong fundamentals during the real estate investment cycle adjustment [3] - A50 non-financial ROE is expected to stabilize and recover ahead of the overall non-financial sector, driven by cost improvements and shareholder returns [3] - The current valuation of these companies reflects a higher implied cost of equity than the market average, indicating potential for a significant reduction in risk premium if investors reassess the overlooked growth resilience [3]
中金:多模态推理助力智能驾驶能力升阶,相关主线值得关注
news flash· 2025-06-03 00:32
Core Insights - Google Gemini 2.5 is set to be released in March, enabling multimodal fusion reasoning [1] - Companies such as Starry Sky, SenseTime, and MiniMax have recently launched multimodal reasoning achievements between April and May, indicating significant technological progress [1] - The integration of multimodal thinking chains is leading to a unified architecture for multimodal and reasoning models, enhancing multimodal understanding capabilities [1] Industry Developments - The recent advancements in multimodal reasoning are expected to extend application scenarios, particularly in the automotive sector with companies like Li Auto and NIO implementing multimodal reasoning in user interactions [1] - The ongoing innovation in technological architecture is likely to continue driving the expansion of application scenarios in the industry [1] - The focus on multimodal reasoning as a primary development line is becoming increasingly important [1]
中金 | AI智道(9):多模态推理技术突破,向车端场景延伸
中金点睛· 2025-06-02 23:45
文 / 于钟海 , 魏鹳霏 , 肖楷 , 赵丽萍 中金研究 以MiniMax V-Triune新框架成果为例,推理感知统一框架在可拓展性、泛化性初步验证。 V-Triune以三层组件架构实现视觉推理和感知任务统一至强化学 习框架:1)多模态样本数据格式化;2)验证器奖励计算,采用异步客户端-服务器架构,奖励计算和主训练循环解耦;3)数据源级指标监控,便于溯源 和提升稳定性。结合动态IoU奖励机制、冻结ViT参数等工程优化,Orsta系列模型(32B参数)在MEGA-Bench Core基准测试中实现了最高14.1%的性能提 升。 多模态推理助力智能驾驶能力升阶。 在智能驾驶场景,多模态推理是增强道路交通标志识别判断能力、提升复杂场景泛化性的重要途径,正成为头部智 能驾驶企业算法演进的重点。2025年5月30日,蔚来世界模型NVM首个版本正式开启推送,具备全量理解、想象重构和推理能力,能够对实时环境多模信 息进行理解和推演,在选择最优ETC车道通行、停车场自主寻路等场景的性能提升显著。此外,理想自研的VLA大模型亦具备思维链推理能力,以多模态 推理模拟人类驾驶员的思维运作方式。 图表1:MiniMax多模态RL ...