Workflow
多模态大语言模型(MLLMs)
icon
Search documents
ICCV 2025 | ECD:高质量合成图表数据集,提升开源MLLM图表理解能力
机器之心· 2025-08-21 13:08
本文第一作者杨昱威,来自澳大利亚国立大学,合作者包括章泽宇(澳大利亚国立大学)、侯云钟(澳大利亚国立大学)、李卓婉(约翰霍普金斯大学)、 Gaowen Liu(思科)、Ali Payani(思科)、丁源森(俄亥俄州立大学)以及郑良(澳大利亚国立大学)。 背景与动机 在科研、新闻报道、数据分析等领域,图表是信息传递的核心载体。要让多模态大语言模型(MLLMs)真正服务于科学研究,必须具备以下两个能力: 1. 精准识别与理解图表元素(如坐标轴、图例、数据点、标题等); 2. 对图表数据进行深度推理(如计算差值、比较趋势、跨子图推理等); 然而,即便是最先进的开源多模态大语言模型(MLLMs),在高难度科学图表理解基准测试上准确率依旧徘徊在 30%–50%。尽管合成数据集易于生成,但它们通 常存在以下问题: 风格单一:缺乏视觉和内容多样性; 缺乏真实性:与真实图表的分布差异较大; 数据模式受限:生成的图表数据过于简单,无法模拟复杂场景; 数据集亮点 论文标题:Effective Training Data Synthesis for Improving MLLM Chart Understanding 论文地址:h ...
X-SAM:从「分割一切」到「任意分割」:统一图像分割多模态大模型,在20+个图像分割数据集上均达SoTA
机器之心· 2025-08-19 06:33
本研究由中山大学、鹏城实验室、美团联合完成,第一作者王豪为中山大学博士研究生,主要研究方向为图像和视频分割、开放场景视觉感知、多模态大模型 等。论文共同通讯作者为梁小丹教授和蓝湘源副研究员。 背景与动机 Segment Anything Model (SAM) 作为基础分割模型在密集分割掩码生成方面表现卓越,但其依赖视觉提示的单一输入模式限制了在广泛图像分割任务中的适用性。 多模态大语言模型(MLLMs)虽在图像描述、视觉问答等任务中表现出色,但输出局限于文本生成,无法直接处理像素级视觉任务,这一根本性限制阻碍了通用 化模型的发展。 中山大学、鹏城实验室、美团联合提出 X- SA M —— 一个统一的图像分割多模态大模型,将 分割范式从 「 分割万 物 」扩展到 「 任意分割 」 。X-SAM 引入了 统一框架,使 MLLMs 具备高级像素级感知理解能力。研究团队提出了 视觉定位分割(Visual Grounded Segmentation, VGS) 新任务,通过交互式视觉提示分割 所有实例对象,赋予 MLLMs 视觉定位的像素级理解能力。为支持多样化数据源的有效训练,X-SAM 采用统一训练策略,支持跨数 ...
穆尧团队最新!RoboTwin 2.0:用于鲁棒双臂操作的可扩展数据基准
自动驾驶之心· 2025-06-24 12:41
Core Insights - The article discusses the development of RoboTwin 2.0, a scalable data generation framework aimed at enhancing bimanual robotic manipulation through robust domain randomization and automated expert data generation [2][6][18]. Group 1: Motivation and Challenges - Existing synthetic datasets for bimanual robotic manipulation are insufficient, facing challenges such as lack of efficient data generation methods for new tasks and overly simplified simulation environments [2][5]. - RoboTwin 2.0 addresses these challenges by providing a scalable simulation framework that supports automatic, large-scale generation of diverse and realistic data [2][6]. Group 2: Key Components of RoboTwin 2.0 - RoboTwin 2.0 integrates three key components: an automated expert data generation pipeline, comprehensive domain randomization, and entity-aware adaptation for diverse robotic platforms [6][18]. - The automated expert data generation pipeline utilizes multimodal large language models (MLLMs) and simulation feedback to iteratively optimize task execution code [10][12]. Group 3: Domain Randomization - Domain randomization is applied across five dimensions: clutter, background texture, lighting conditions, desktop height, and diverse language instructions, enhancing the robustness of strategies against environmental variability [12][13]. - The framework generates a large object library (RoboTwin-OD) with 731 instances across 147 categories, each annotated with semantic and operational labels [3][18]. Group 4: Data Collection and Benchmarking - Over 100,000 dual-arm operation trajectories were collected across 50 tasks, supporting extensive benchmarking and evaluation of robotic strategies [24][22]. - The framework allows for flexible entity configurations, ensuring compatibility with diverse hardware setups and promoting scalability for future robotic platforms [20][22]. Group 5: Experimental Analysis - Evaluations demonstrated that RoboTwin 2.0 significantly improves the success rates of robotic tasks, particularly for low-degree-of-freedom platforms, with average increases of 8.3% in task success rates [29][31]. - The framework's data enhances the generalization capabilities of models, showing substantial improvements in performance when tested in unseen scenarios [32][34].
细粒度视觉推理链引入数学领域,准确率暴涨32%,港中文MMLab打破多模态数学推理瓶颈
量子位· 2025-06-16 10:30
Core Viewpoint - The article discusses the introduction of MINT-CoT, a new visual reasoning framework designed to enhance multimodal mathematical reasoning by addressing the limitations of traditional Chain of Thought (CoT) methods in handling visual information [1][3]. Group 1: Challenges in Mathematical Visual Reasoning - Traditional CoT methods struggle with integrating visual information in mathematical contexts due to three main bottlenecks: 1. Coarse-grained image region selection, where most methods rely on bounding boxes that may include irrelevant information [4]. 2. Visual encoders that are not trained to understand mathematical images, leading to poor perception of mathematical content [5]. 3. Over-reliance on external functionalities, which increases training and inference costs and reduces generalizability [6]. Group 2: MINT-CoT Framework - MINT-CoT (Multimodal Interleaved Chain-of-Thought) is introduced as a fine-grained, lightweight visual interleaved CoT reasoning method specifically designed for mathematical reasoning scenarios. It innovatively incorporates an Interleave Token that dynamically selects relevant visual tokens during the reasoning process, allowing for true "text-visual joint reasoning" [9]. - The MINT-CoT dataset consists of 54,000 visual interleaved reasoning samples, providing aligned information between reasoning steps and corresponding image tokens [11]. Group 3: Training Strategy - A three-stage training strategy is implemented to enhance the visual interleaved reasoning capabilities of the MINT-CoT framework: 1. Text CoT fine-tuning to establish a foundation for general reasoning formats. 2. Interleaved modality CoT fine-tuning to teach the model to insert visual content appropriately during reasoning. 3. Interleaved modality CoT reinforcement learning to optimize visual content selection and reasoning strategies [13]. Group 4: Experimental Results - The MINT-CoT-7B model, based on the multimodal large model Qwen-VL-7B, demonstrates superior performance in mathematical visual reasoning tasks, achieving significant improvements over baseline models: +32.59% on MathVista, +26.92% on GeoQA, and +23.2% on MMStar, establishing a new benchmark in the field [16].
ICML 2025 Spotlight|南洋理工陶大程教授团队等提出基于RAG的高分辨率图像感知框架,准确率提高20%
机器之心· 2025-05-16 16:31
Core Viewpoint - The article discusses the development of Retrieval-Augmented Perception (RAP), a method that enhances multi-modal large language models (MLLMs) for high-resolution image perception without requiring training [3][29]. Group 1: Challenges in High-Resolution Image Processing - Traditional multi-modal large language models (MLLMs) struggle with high-resolution images, often leading to loss of visual information due to fixed input resolutions [1][2]. - Current solutions include cropping high-resolution images into smaller segments, using visual encoders that handle higher resolutions, and search-based methods that construct tree structures for image retrieval [2][3]. Group 2: Introduction of RAP - RAP is introduced as a solution that leverages retrieval-augmented generation (RAG) techniques to improve MLLM's perception of high-resolution images [3][29]. - The method has been accepted at ICML 2025 and recognized as a Spotlight paper, indicating its significance in the field [3]. Group 3: Experimental Findings - The research explores the layout of retrieved image segments, the impact of the number of segments on performance, and how to effectively apply RAG in MLLMs [6][11]. - Maintaining the relative position of retrieved image segments is crucial, especially for tasks requiring spatial awareness [10][15]. - The number of retrieved segments affects performance differently across tasks, with fewer segments being beneficial for single-instance perception tasks (FSP) and more segments needed for multi-instance perception tasks (FCP) [14][24]. Group 4: Methodology of RAP - RAP employs a Spatial-Awareness Layout algorithm to maintain the relative positions of image segments while reducing resolution [16][19]. - The RE-Search component adapts the number of retained segments based on similarity scores and model confidence, enhancing the overall performance [20][22]. Group 5: Performance Results - Experimental results show that RAP significantly improves performance on high-resolution image perception tasks, achieving up to 21% accuracy improvement on HR-Bench datasets [25][26]. - Compared to other methods, RAP demonstrates superior throughput and accuracy, outperforming existing search-based methods [27].
征稿倒计时!CVPR 2025 Workshop共话“基础模型+X”的鲁棒性挑战
量子位· 2025-03-08 03:35
Core Viewpoint - The article discusses the upcoming 2025 CVPR conference, focusing on the fifth workshop on adversarial machine learning, which will explore the robustness challenges of foundation models and their specific applications in various fields [1][2]. Group 1: Workshop Details - The fifth workshop on adversarial machine learning will be held from June 11 to June 15, 2025, in Tennessee, USA, organized by prestigious institutions including Beihang University and Nanyang Technological University [1]. - The workshop's theme is "Foundation Models + X," emphasizing the robustness challenges of foundation models (FM) and their domain-specific applications (XFM) [2]. Group 2: Research Focus - Foundation models have transformed multiple fields, including computer vision, but their domain-specific adaptations (XFM) face vulnerabilities to adversarial attacks, which can lead to critical failures in safety-sensitive applications like autonomous driving and medical diagnostics [2][4]. - The workshop invites submissions related to various topics, including the robustness of X-domain-specific foundation models and adversarial attacks for social good [3][4]. Group 3: Competition Announcement - A competition will be held during the workshop, focusing on adversarial attacks against multimodal large language models (MLLMs), encouraging participants to design harmful image-text pairs [7]. - The competition will consist of preliminary and final rounds, with participants tasked to create adversarial pairs that trigger unsafe outputs from MLLMs [7].