多模态推理

Search documents
ICCV 2025 | ECD:高质量合成图表数据集,提升开源MLLM图表理解能力
机器之心· 2025-08-21 13:08
本文第一作者杨昱威,来自澳大利亚国立大学,合作者包括章泽宇(澳大利亚国立大学)、侯云钟(澳大利亚国立大学)、李卓婉(约翰霍普金斯大学)、 Gaowen Liu(思科)、Ali Payani(思科)、丁源森(俄亥俄州立大学)以及郑良(澳大利亚国立大学)。 背景与动机 在科研、新闻报道、数据分析等领域,图表是信息传递的核心载体。要让多模态大语言模型(MLLMs)真正服务于科学研究,必须具备以下两个能力: 1. 精准识别与理解图表元素(如坐标轴、图例、数据点、标题等); 2. 对图表数据进行深度推理(如计算差值、比较趋势、跨子图推理等); 然而,即便是最先进的开源多模态大语言模型(MLLMs),在高难度科学图表理解基准测试上准确率依旧徘徊在 30%–50%。尽管合成数据集易于生成,但它们通 常存在以下问题: 风格单一:缺乏视觉和内容多样性; 缺乏真实性:与真实图表的分布差异较大; 数据模式受限:生成的图表数据过于简单,无法模拟复杂场景; 数据集亮点 论文标题:Effective Training Data Synthesis for Improving MLLM Chart Understanding 论文地址:h ...
4o-mini华人领队也离职了,这次不怪小扎
量子位· 2025-08-19 01:17
Core Viewpoint - OpenAI's former key researcher Kevin Lu has left to join Thinking Machine Lab, a new AI startup co-founded by former OpenAI CTO Mira Murati, which has reached a valuation of $12 billion [3][19]. Group 1: Kevin Lu's Background and Contributions - Kevin Lu has a strong background in reinforcement learning and small model development, having previously worked at Hudson River Trading, Meta, and OpenAI [5][6]. - At OpenAI, he led the development of the 4o-mini model, which is a multimodal reasoning small model that supports text and image input, designed for complex tasks with improved speed and lower costs [7][9]. - His most cited paper, "Decision Transformer: Reinforcement Learning via Sequence Modeling," has been cited 2,254 times and presents a framework for treating reinforcement learning as conditional sequence modeling [10][11]. Group 2: Thinking Machine Lab - Thinking Machine Lab has attracted several former core researchers from OpenAI, including John Schulman and Barrett Zoph, and has recently completed a record-breaking $2 billion seed funding round [4][17]. - The startup has not yet publicly disclosed any results, which has generated significant anticipation within the AI community [21]. - Despite competitive offers from other tech giants, the team members at Thinking Machine Lab have chosen to remain, indicating strong confidence in the startup's potential [20].
智谱推出全球100B级最强开源多模态模型GLM-4.5V:获41个榜单SOTA
IPO早知道· 2025-08-12 01:52
Core Viewpoint - The article discusses the launch of GLM-4.5V, a state-of-the-art open-source visual reasoning model by Zhipu, which is a significant step towards achieving Artificial General Intelligence (AGI) [3][4]. Group 1: Model Overview - GLM-4.5V features a total of 106 billion parameters, with 12 billion activation parameters, and is designed for multi-modal reasoning, which is essential for AGI [3][4]. - The model builds on the previous GLM-4.1V-Thinking, showcasing enhanced performance across various visual tasks, including image, video, and document understanding [4][6]. Group 2: Performance Metrics - In 41 public multi-modal benchmarks, GLM-4.5V achieved state-of-the-art (SOTA) performance, outperforming other models in tasks such as general visual question answering (VQA) and visual grounding [5][6]. - Specific performance metrics include a general VQA score of 88.2 on MMBench v1.1 and 91.3 on RefCOCO-avg for visual grounding tasks [5]. Group 3: Technical Features - The model incorporates a visual encoder, MLP adapter, and language decoder, supporting 64K multi-modal long contexts and enhancing video processing efficiency through 3D convolution [6][8]. - It utilizes a three-stage training strategy: pre-training, supervised fine-tuning (SFT), and reinforcement learning (RL), which collectively improve its multi-modal understanding and reasoning capabilities [8]. Group 4: Practical Applications - Zhipu has developed a desktop assistant application that leverages GLM-4.5V for real-time screen capture and processing various visual reasoning tasks, enhancing user interaction and productivity [8][9]. - The company aims to empower developers through model open-sourcing and API services, encouraging innovative applications of multi-modal models [9].
gpt5
小熊跑的快· 2025-08-07 22:41
Core Viewpoint - The launch of GPT-5 represents a significant advancement in artificial intelligence, showcasing improvements in various applications such as coding, health, and visual perception, while reducing the model's hallucination rate and enhancing reasoning capabilities [1][2]. Group 1: Model Capabilities - GPT-5 is a unified system that can efficiently respond to a wide range of queries, utilizing a more advanced reasoning model to tackle complex problems [2]. - The model has shown significant improvements in coding, particularly in generating and debugging complex front-end applications, websites, and games [3]. - In health-related applications, GPT-5 outperforms previous models, providing more accurate and context-aware responses, and acting as a supportive partner for users [4]. Group 2: Performance Metrics - GPT-5 has demonstrated a notable reduction in hallucination rates, with a 45% lower chance of factual errors compared to GPT-4o and an 80% reduction compared to OpenAI o3 during reasoning tasks [11]. - The model's honesty in responses has improved, with a significant decrease in the rate of misleading answers, dropping from 4.8% in OpenAI o3 to 2.1% in GPT-5 [13]. Group 3: Accessibility and User Experience - GPT-5 is being rolled out to all Plus, Pro, Team, and Free users, with Enterprise and Edu access expected shortly [14]. - Professional subscribers enjoy unlimited access to GPT-5 and its Pro version, while free users will experience a transition to a mini version upon reaching usage limits [14].
量子位智库:2025上半年AI核心成果及趋势报告
Sou Hu Cai Jing· 2025-08-02 23:06
Application Trends - General-purpose Agents are becoming mainstream, integrating tool usage to complete diverse deep research tasks, with a focus on visual operations [1][11] - Vertical Agents are emerging in various scenarios such as travel and design, with natural language control becoming part of workflows [1][15] - AI programming is rapidly growing, with leading applications experiencing significant revenue growth and continuous product evolution [1][16] Model Trends - Reasoning capabilities are continuously improving, especially in mathematical and coding problems, with large models becoming more agentic and enhancing tool usage capabilities [1][24] - Multi-modal reasoning is being integrated, enhancing the ability to generate images and videos, while small models are accelerating in popularity [1][25] - The Model Context Protocol (MCP) is gaining attention, providing a standardized interface for efficient and secure external data calls, although it has not yet reached large-scale production [1][19][21] Technology Trends - Training resources are shifting towards post-training and reinforcement learning, with the importance of reinforcement learning increasing [1][27] - Multi-agent systems are becoming a frontier paradigm, and online learning is emerging as a core breakthrough direction [1][27] - The Transformer model architecture is rapidly iterating, with hybrid architectures emerging, and code verification is becoming a forefront direction for AI programming automation [1][27] Industry Trends - The gap between leading players in model capabilities is narrowing, with OpenAI's competitive advantage weakening as Google and xAI catch up [2] - The competition gap between the US and China in large models is decreasing, with China showing strong performance in multi-modal areas [2] - AI programming is becoming a critical battleground, with major players both domestically and internationally intensifying their efforts [2]
什么是真正好用的推理模型?阶跃Step 3:开源的,多模态的,低成本的,国产芯片适配的
量子位· 2025-07-27 11:57
Core Viewpoint - The article emphasizes the significance of the new multi-modal reasoning model, Step 3, released by Jumpshare, which fills a gap in the current AI landscape by combining strong reasoning capabilities with efficiency and open-source accessibility [4][5][25]. Group 1: Model Features - Step 3 is a 321 billion parameter MoE model with multi-modal reasoning capabilities, set to be officially open-sourced on July 31 [5][24]. - It achieved new state-of-the-art (SOTA) results in open-source multi-modal reasoning benchmarks [6]. - The model's reasoning decoding cost is only one-third of DeepSeek's, demonstrating superior efficiency [8]. Group 2: Market Trends - The industry is shifting towards multi-modal models, with reasoning capabilities becoming a focal point as generative AI enters a reasoning era [10]. - Efficiency, cost, and deployment friendliness are now critical factors in evaluating model performance, beyond just ranking [11][12]. Group 3: Innovations in Step 3 - Step 3 incorporates two key innovations: the AFD distributed reasoning system and the MFA attention mechanism, enhancing decoding efficiency and reducing reasoning costs [31][35]. - AFD separates Attention and FNN tasks to optimize resource usage, while MFA improves KV cache and computation efficiency [32][36]. Group 4: Cost Efficiency - Step 3's design allows it to operate with significantly lower costs compared to competitors, achieving a cost that is only 30% of DeepSeek-V3's on certain hardware [42]. - The model supports FP8 quantization, further reducing memory access and latency [41]. Group 5: Industry Collaboration - Jumpshare has initiated the "Model-Chip Ecological Innovation Alliance" with nearly ten chip and infrastructure partners to enhance model adaptability and computational efficiency [54][55]. - This collaboration aims to ensure that the model can run effectively on various hardware, including domestic chips [51][52]. Group 6: Application and Market Potential - Jumpshare's multi-modal reasoning model has been successfully integrated into various applications, including automotive and mobile devices, with significant interest from top manufacturers [60][69]. - The company anticipates a revenue of nearly 1 billion RMB in 2025, indicating a clear commercialization path for its technology [74].
Zebra-CoT:开创性视觉思维链数据集问世,多模态推理准确率提升13%
具身智能之心· 2025-07-24 09:53
Core Viewpoint - The article discusses the development of Zebra-CoT, a large-scale and diverse dataset aimed at enhancing visual reasoning capabilities in multi-modal models, addressing the challenges of existing visual CoT performance and the lack of high-quality training data [3][4]. Dataset Construction - Zebra-CoT consists of 182,384 samples, providing logical interleaved text-image reasoning trajectories across four main task categories: scientific reasoning, 2D visual reasoning, 3D visual reasoning, and visual logic and strategy games [6][12]. - The dataset overcomes limitations of existing datasets by offering a diverse range of tasks and ensuring high-quality text reasoning data, unlike previous datasets that focused on single tasks or lacked clear reasoning structures [6][18]. Task Coverage - The dataset covers four major task categories: - Scientific reasoning includes geometry, physics, chemistry, and algorithm problems [9]. - 2D visual reasoning encompasses visual search and visual puzzles [9]. - 3D visual reasoning involves multi-hop object counting and robot planning [9]. - Visual logic and strategy games feature chess, checkers, mazes, and more [9]. Data Sources and Processing - Real-world data is sourced from online resources, ensuring high-quality problem extraction and addressing issues of logical connections between modalities [10]. - Synthetic data is generated using templates and visual language models (VLM) to enhance reasoning diversity and expressiveness [10]. Model Fine-tuning and Performance - Fine-tuning the Anole-7B model on Zebra-CoT improved accuracy from 4.2% to 16.9%, a fourfold increase, with notable improvements in visual logic benchmarks [14]. - The Bagel-7B model, after fine-tuning, demonstrated the ability to generate high-quality interleaved visual reasoning chains, showcasing the dataset's effectiveness in developing multi-modal reasoning capabilities [14]. Limitations - Despite its strengths, the dataset relies on template generation for synthetic data, which may limit the diversity and expressiveness of text reasoning [18]. - Some sub-tasks within the dataset have a small sample size, potentially affecting model performance in those areas [18]. - Model fine-tuning results may vary, with some tasks showing insignificant or even decreased performance, indicating a need for further optimization [18].
美团提出多模态推理新范式:RL+SFT非传统顺序组合突破传统训练瓶颈
量子位· 2025-07-21 04:23
Core Viewpoint - The article discusses the Metis-RISE framework developed by researchers from Meituan, which combines Reinforcement Learning (RL) and Supervised Fine-Tuning (SFT) in a novel way to enhance the reasoning capabilities of Multimodal Large Language Models (MLLMs) [1][2]. Summary by Sections Introduction of Metis-RISE Framework - The Metis-RISE framework integrates RL and SFT in a non-traditional sequence to effectively improve MLLMs' reasoning abilities [2][3]. Training Methodology - The training process consists of two phases: - Phase 1 focuses on RL incentives, allowing the model to explore freely and activate its potential [6]. - Phase 2 employs SFT to address specific weaknesses identified during the RL phase [7][8]. Performance Results - The models developed, Metis-RISE-7B and Metis-RISE-72B, achieved impressive scores on the OpenCompass multimodal reasoning leaderboard, with the 72B model ranking fourth overall [3][14]. - Metis-RISE-72B achieved an average score of 56.6, outperforming several proprietary models and demonstrating its competitive edge [13][14]. Comparative Analysis - The performance of Metis-RISE models was compared against proprietary models and open-source models, showing superior results, particularly in the >10B parameter category [11][12][13]. Ablation Studies - Detailed ablation studies indicated that the RL phase significantly improved the model's performance, with average scores increasing from 39.2 to 44.0 after applying RL [15][16]. Qualitative Analysis - Observations during the RL phase revealed a consistent increase in accuracy rewards and response lengths, indicating improved reasoning clarity as training progressed [17]. Future Directions - The team plans to continue exploring iterative applications of RL and SFT to further enhance reasoning capabilities and develop model-based validators for more complex reasoning scenarios [18].
感知错误率降低30.5%:隐式感知损失让模型主动“睁大眼睛” | UIUC&阿里通义
量子位· 2025-07-11 04:00
Core Viewpoint - The article discusses the introduction of a new reinforcement learning algorithm called PAPO (Perception-Aware Policy Optimization) developed by the University of Illinois Urbana-Champaign and Alibaba's Tongyi Laboratory, which focuses on enhancing multimodal reasoning by integrating perception into the learning process [1][3]. Group 1: Introduction of PAPO - PAPO aims to address the limitations of existing reinforcement learning algorithms like GRPO, which excel in text reasoning but struggle with multimodal scenarios due to inadequate visual information utilization [2][3]. - The algorithm introduces an innovative implicit perception loss design that relies on internal supervisory signals, allowing multimodal models to learn perception alongside reasoning [3][6]. Group 2: Error Analysis and Findings - A systematic error analysis revealed that the primary issue in multimodal reasoning is the accuracy of visual perception, rather than logical reasoning capabilities [6][7]. - The analysis of 200 error cases from the Qwen2.5-VL-3B model trained with GRPO showed that 67% of errors were due to perception inaccuracies, while only 18% were due to reasoning errors [9][14]. Group 3: Technical Innovations of PAPO - PAPO's core innovation includes the design of a perception information gain ratio and the maximization of KL divergence to encourage different output distributions for original and damaged images [19][20]. - The complete objective function for PAPO is presented as a simple extension of GRPO, integrating the KL divergence term [21]. Group 4: Experimental Validation - Comprehensive evaluations on eight multimodal reasoning benchmarks demonstrated that PAPO consistently outperformed GRPO, achieving an overall average improvement of 4.4% and a significant 30.5% reduction in perception errors [26][28]. - PAPO exhibited faster convergence and more stable training dynamics compared to GRPO, starting to show improvements as early as 25 training steps [29][30]. Group 5: Visual Dependency Analysis - The analysis of visual dependency in mainstream multimodal reasoning benchmarks indicated that many tasks contain substantial visual information, allowing models to answer correctly without visual input [50][51]. - PAPO showed the most significant improvements in high-visual-dependency tasks, with nearly an 8% enhancement, while maintaining consistent improvements across medium and low-dependency tasks [53][54]. Group 6: Practical Applications - Several practical application cases illustrate PAPO's effectiveness in complex geometric problems, such as accurately calculating relationships in right triangles and distinguishing between different objects [55][63][64].
告别数据「噪音」,UCSD大模型推理新方法DreamPRM充当「信号放大器」,登顶MathVista测评榜
机器之心· 2025-07-10 10:49
Core Viewpoint - DreamPRM, developed by a research team from the University of California, San Diego, has achieved the top position on the MathVista mathematical reasoning leaderboard, showcasing its significant advancements in multimodal reasoning capabilities [1][6][22]. Summary by Sections Introduction - DreamPRM utilizes a dual-layer optimization framework to enhance the reasoning abilities of multimodal large language models (MLLMs) by addressing challenges such as data quality imbalance and distribution shift [2][12]. Methodology - The core innovation of DreamPRM lies in constructing the training process of the process reward model (PRM) as a differentiable dual-layer optimization problem, dynamically adjusting domain weights to mitigate issues in multimodal reasoning [12][22]. - The lower optimization phase trains PRM parameters across 15 diverse training domains, assigning dynamic weights to reflect each domain's contribution to the overall loss function [13][14]. - The upper optimization phase employs a carefully constructed metadata set covering 30 disciplines and 183 subfields to evaluate the generalization capability of the PRM [12][14]. Performance Results - DreamPRM has demonstrated superior performance across five benchmark tests, consistently outperforming other PRM methods by 2-3% compared to the original PRM without data selection [16][22]. - The model, with only 8 billion parameters, outperformed larger closed-source models like GPT-4v and Gemini-1.5 in most benchmarks, indicating its strong reasoning capability [16][22]. - The accuracy of DreamPRM improves as the number of candidate reasoning chains (CoTs) increases, with performance enhancements observed when applied to stronger models like GPT-4.1-mini and o4-mini [19][20]. Conclusion - DreamPRM effectively addresses the challenges of data quality imbalance and distribution shift in training multimodal process reward models, achieving notable improvements in performance, particularly in complex mathematical reasoning tasks [22].