多模态大语言模型(MLLMs)

Search documents
给几何图片写标题就能让AI更聪明,UIUC发布高质量可泛化几何数据集
机器之心· 2025-09-25 23:54
Core Viewpoint - The article discusses the advancements in multi-modal large language models (MLLMs) and introduces a new framework called Geo-Image-Textualization, which addresses the limitations in geometric reasoning tasks by ensuring complete alignment between visual and textual information [1][21]. Group 1: Framework and Dataset - A research team from UIUC has proposed a reinforcement learning-based data generation and optimization framework called Geo-Image-Textualization, along with the release of the first fully aligned high-quality geometric image-text dataset, GeoReasoning-10K, which contains 10,000 carefully constructed image-description pairs [2][3]. - The GeoReasoning-10K dataset and related code have been made publicly available to promote community development [3][5]. Group 2: Innovations and Performance - The core innovations of the framework include a generation process for image-title-question/answer pairs, which enhances the model's performance in geometric reasoning tasks [6][8]. - The trained model demonstrates strong generalization capabilities, performing well not only in geometric tasks but also in arithmetic, algebra, and numerical reasoning, even with non-geometric image inputs [8]. - Models trained with GeoReasoning outperform other similar datasets in downstream tasks and exhibit good scalability [8][12]. Group 3: Experimental Results - In authoritative mathematical reasoning benchmarks MathVista and MathVerse, GeoReasoning-10K achieved optimal results compared to other geometric captioning datasets, showcasing superior data quality and extensibility [12][14]. - The article presents specific examples from the MathVista benchmark, illustrating the model's ability to solve complex geometric problems effectively [16][21]. Group 4: Future Implications - The Geo-Image-Textualization framework and GeoReasoning-10K dataset provide a new approach to overcoming the bottlenecks in geometric reasoning, enhancing the overall mathematical reasoning capabilities of AI models, and paving the way for applications in education and scientific computation [21][22].
西交利物浦&港科最新!轨迹预测基座大模型综述
自动驾驶之心· 2025-09-24 23:33
Core Insights - The article discusses the application of large language models (LLMs) and multimodal large language models (MLLMs) in the paradigm shift for autonomous driving trajectory prediction, enhancing the understanding of complex traffic scenarios to improve safety and efficiency [1][20]. Summary by Sections Introduction and Overview - The integration of LLMs into autonomous driving systems allows for a deeper understanding of traffic scenarios, transitioning from traditional methods to LFM-based approaches [1]. - Trajectory prediction is identified as a core technology in autonomous driving, utilizing historical data and contextual information to infer future movements of traffic participants [5]. Traditional Methods and Challenges - Traditional vehicle trajectory prediction methods include physics-based approaches (e.g., Kalman filters) and machine learning methods (e.g., Gaussian processes), which struggle with complex interactions [8]. - Deep learning methods improve long-term prediction accuracy but face challenges such as high computational demands and poor interpretability [9]. - Reinforcement learning methods excel in interactive scene modeling but are complex and unstable [9]. LLM-Based Vehicle Trajectory Prediction - LFM introduces a paradigm shift by discretizing continuous motion states into symbolic sequences, leveraging LLMs' semantic modeling capabilities [11]. - Key applications of LLMs include trajectory-language mapping, multimodal fusion, and constraint-based reasoning, enhancing interpretability and robustness in long-tail scenarios [11][13]. Evaluation Metrics and Datasets - The article categorizes datasets for pedestrian and vehicle trajectory prediction, highlighting the importance of datasets like Waymo and ETH/UCY for evaluating model performance [16]. - Evaluation metrics for vehicles include L2 distance and collision rates, while pedestrian metrics focus on minADE and minFDE [17]. Performance Comparison - A performance comparison of various models on the NuScenes dataset shows that LLM-based methods significantly reduce collision rates and improve long-term prediction accuracy [18]. Discussion and Future Directions - The widespread application of LFMs indicates a shift from local pattern matching to global semantic understanding, enhancing safety and compliance in trajectory generation [20]. - Future research should focus on developing low-latency inference techniques, constructing motion-oriented foundational models, and advancing world perception and causal reasoning models [21].
ICCV 2025 | ECD:高质量合成图表数据集,提升开源MLLM图表理解能力
机器之心· 2025-08-21 13:08
Core Viewpoint - The article discusses the development of the Effective Chart Dataset (ECD), a high-quality synthetic chart dataset aimed at improving the understanding of charts by multimodal large language models (MLLMs) [4][6][25]. Background and Motivation - In fields like scientific research and data analysis, charts are essential for information transmission. MLLMs must accurately identify and understand chart elements and perform deep reasoning on chart data. Current MLLMs struggle with high difficulty scientific chart understanding, achieving only 30%-50% accuracy [4][6]. Dataset Highlights - ECD is introduced as a large-scale, high-quality synthetic chart dataset with a modular data synthesis pipeline and a comprehensive evaluation benchmark called ECDBench [6][10]. - ECD includes over 10,500 charts, covering 25 themes and 29 chart types, with 252 combinations of subplots, making it the most extensive dataset in its category [12][10]. Quality and Diversity - The dataset contains over 300,000 question-answer pairs generated by GPT-4o, ensuring high quality through confidence filtering. Examples include descriptive and reasoning questions related to the charts [10][11]. - ECD achieves the lowest Frechet Inception Distance (FID) score, indicating high visual similarity to real scientific charts, and has a higher average pixel entropy compared to other synthetic datasets, suggesting greater complexity and information content [13][10]. Data Synthesis Process - The five-stage modular data synthesis pipeline includes single chart generation, multi-subplot combinations, visual diversity enhancement, image quality filtering, and question-answer pair generation [15][16]. Model Performance Comparison - ECD significantly improves the performance of various open-source MLLMs when fine-tuned with the dataset. For instance, LLaVA-Next-Llama3-8B showed substantial performance gains across multiple test sets after being trained with ECD [17][23]. Evaluation Benchmark - ECDBench is established as a high-quality evaluation benchmark for assessing the performance of MLLMs before and after fine-tuning with ECD. It provides comprehensive statistics for model evaluation [21][25]. Conclusion - ECD and ECDBench provide a solid foundation for advancing multimodal reasoning, scientific AI assistants, and automated chart generation, enhancing the capabilities of MLLMs in understanding complex chart data [25].
X-SAM:从「分割一切」到「任意分割」:统一图像分割多模态大模型,在20+个图像分割数据集上均达SoTA
机器之心· 2025-08-19 06:33
Core Viewpoint - The article discusses the development of X-SAM, a unified multimodal large language model for image segmentation, which enhances the capabilities of existing models by allowing for pixel-level understanding and interaction through visual prompts [4][26]. Background and Motivation - Segment Anything Model (SAM) excels in dense segmentation mask generation but is limited by its reliance on single input modes, hindering its applicability across various segmentation tasks [4]. - Multimodal large language models (MLLMs) have shown promise in tasks like image description and visual question answering but are fundamentally restricted in handling pixel-level visual tasks, which limits the development of generalized models [4]. Method Design - X-SAM introduces a unified framework that extends the segmentation paradigm from "segment anything" to "any segmentation" by incorporating visual grounded segmentation (VGS) tasks [4]. - The model employs a dual projectors architecture to enhance image understanding and a segmentation connector to provide rich multi-scale information for segmentation tasks [11][12]. - X-SAM utilizes a three-stage progressive training strategy to optimize performance across diverse image segmentation tasks, including segmentor fine-tuning, alignment pre-training, and mixed fine-tuning [16][22]. Experimental Results - X-SAM has been evaluated on over 20 segmentation datasets, achieving state-of-the-art performance across seven different image segmentation tasks [19]. - The model's performance metrics indicate significant improvements in various segmentation tasks compared to existing models, showcasing its versatility and effectiveness [20][21]. Summary and Outlook - X-SAM represents a significant advancement in the field of image segmentation, establishing a foundation for future research in video segmentation and the integration of temporal information [26]. - Future directions include expanding the model's capabilities to video segmentation tasks, potentially enhancing video understanding technologies [26].
穆尧团队最新!RoboTwin 2.0:用于鲁棒双臂操作的可扩展数据基准
自动驾驶之心· 2025-06-24 12:41
Core Insights - The article discusses the development of RoboTwin 2.0, a scalable data generation framework aimed at enhancing bimanual robotic manipulation through robust domain randomization and automated expert data generation [2][6][18]. Group 1: Motivation and Challenges - Existing synthetic datasets for bimanual robotic manipulation are insufficient, facing challenges such as lack of efficient data generation methods for new tasks and overly simplified simulation environments [2][5]. - RoboTwin 2.0 addresses these challenges by providing a scalable simulation framework that supports automatic, large-scale generation of diverse and realistic data [2][6]. Group 2: Key Components of RoboTwin 2.0 - RoboTwin 2.0 integrates three key components: an automated expert data generation pipeline, comprehensive domain randomization, and entity-aware adaptation for diverse robotic platforms [6][18]. - The automated expert data generation pipeline utilizes multimodal large language models (MLLMs) and simulation feedback to iteratively optimize task execution code [10][12]. Group 3: Domain Randomization - Domain randomization is applied across five dimensions: clutter, background texture, lighting conditions, desktop height, and diverse language instructions, enhancing the robustness of strategies against environmental variability [12][13]. - The framework generates a large object library (RoboTwin-OD) with 731 instances across 147 categories, each annotated with semantic and operational labels [3][18]. Group 4: Data Collection and Benchmarking - Over 100,000 dual-arm operation trajectories were collected across 50 tasks, supporting extensive benchmarking and evaluation of robotic strategies [24][22]. - The framework allows for flexible entity configurations, ensuring compatibility with diverse hardware setups and promoting scalability for future robotic platforms [20][22]. Group 5: Experimental Analysis - Evaluations demonstrated that RoboTwin 2.0 significantly improves the success rates of robotic tasks, particularly for low-degree-of-freedom platforms, with average increases of 8.3% in task success rates [29][31]. - The framework's data enhances the generalization capabilities of models, showing substantial improvements in performance when tested in unseen scenarios [32][34].
细粒度视觉推理链引入数学领域,准确率暴涨32%,港中文MMLab打破多模态数学推理瓶颈
量子位· 2025-06-16 10:30
Core Viewpoint - The article discusses the introduction of MINT-CoT, a new visual reasoning framework designed to enhance multimodal mathematical reasoning by addressing the limitations of traditional Chain of Thought (CoT) methods in handling visual information [1][3]. Group 1: Challenges in Mathematical Visual Reasoning - Traditional CoT methods struggle with integrating visual information in mathematical contexts due to three main bottlenecks: 1. Coarse-grained image region selection, where most methods rely on bounding boxes that may include irrelevant information [4]. 2. Visual encoders that are not trained to understand mathematical images, leading to poor perception of mathematical content [5]. 3. Over-reliance on external functionalities, which increases training and inference costs and reduces generalizability [6]. Group 2: MINT-CoT Framework - MINT-CoT (Multimodal Interleaved Chain-of-Thought) is introduced as a fine-grained, lightweight visual interleaved CoT reasoning method specifically designed for mathematical reasoning scenarios. It innovatively incorporates an Interleave Token that dynamically selects relevant visual tokens during the reasoning process, allowing for true "text-visual joint reasoning" [9]. - The MINT-CoT dataset consists of 54,000 visual interleaved reasoning samples, providing aligned information between reasoning steps and corresponding image tokens [11]. Group 3: Training Strategy - A three-stage training strategy is implemented to enhance the visual interleaved reasoning capabilities of the MINT-CoT framework: 1. Text CoT fine-tuning to establish a foundation for general reasoning formats. 2. Interleaved modality CoT fine-tuning to teach the model to insert visual content appropriately during reasoning. 3. Interleaved modality CoT reinforcement learning to optimize visual content selection and reasoning strategies [13]. Group 4: Experimental Results - The MINT-CoT-7B model, based on the multimodal large model Qwen-VL-7B, demonstrates superior performance in mathematical visual reasoning tasks, achieving significant improvements over baseline models: +32.59% on MathVista, +26.92% on GeoQA, and +23.2% on MMStar, establishing a new benchmark in the field [16].
ICML 2025 Spotlight|南洋理工陶大程教授团队等提出基于RAG的高分辨率图像感知框架,准确率提高20%
机器之心· 2025-05-16 16:31
Core Viewpoint - The article discusses the development of Retrieval-Augmented Perception (RAP), a method that enhances multi-modal large language models (MLLMs) for high-resolution image perception without requiring training [3][29]. Group 1: Challenges in High-Resolution Image Processing - Traditional multi-modal large language models (MLLMs) struggle with high-resolution images, often leading to loss of visual information due to fixed input resolutions [1][2]. - Current solutions include cropping high-resolution images into smaller segments, using visual encoders that handle higher resolutions, and search-based methods that construct tree structures for image retrieval [2][3]. Group 2: Introduction of RAP - RAP is introduced as a solution that leverages retrieval-augmented generation (RAG) techniques to improve MLLM's perception of high-resolution images [3][29]. - The method has been accepted at ICML 2025 and recognized as a Spotlight paper, indicating its significance in the field [3]. Group 3: Experimental Findings - The research explores the layout of retrieved image segments, the impact of the number of segments on performance, and how to effectively apply RAG in MLLMs [6][11]. - Maintaining the relative position of retrieved image segments is crucial, especially for tasks requiring spatial awareness [10][15]. - The number of retrieved segments affects performance differently across tasks, with fewer segments being beneficial for single-instance perception tasks (FSP) and more segments needed for multi-instance perception tasks (FCP) [14][24]. Group 4: Methodology of RAP - RAP employs a Spatial-Awareness Layout algorithm to maintain the relative positions of image segments while reducing resolution [16][19]. - The RE-Search component adapts the number of retained segments based on similarity scores and model confidence, enhancing the overall performance [20][22]. Group 5: Performance Results - Experimental results show that RAP significantly improves performance on high-resolution image perception tasks, achieving up to 21% accuracy improvement on HR-Bench datasets [25][26]. - Compared to other methods, RAP demonstrates superior throughput and accuracy, outperforming existing search-based methods [27].
征稿倒计时!CVPR 2025 Workshop共话“基础模型+X”的鲁棒性挑战
量子位· 2025-03-08 03:35
Core Viewpoint - The article discusses the upcoming 2025 CVPR conference, focusing on the fifth workshop on adversarial machine learning, which will explore the robustness challenges of foundation models and their specific applications in various fields [1][2]. Group 1: Workshop Details - The fifth workshop on adversarial machine learning will be held from June 11 to June 15, 2025, in Tennessee, USA, organized by prestigious institutions including Beihang University and Nanyang Technological University [1]. - The workshop's theme is "Foundation Models + X," emphasizing the robustness challenges of foundation models (FM) and their domain-specific applications (XFM) [2]. Group 2: Research Focus - Foundation models have transformed multiple fields, including computer vision, but their domain-specific adaptations (XFM) face vulnerabilities to adversarial attacks, which can lead to critical failures in safety-sensitive applications like autonomous driving and medical diagnostics [2][4]. - The workshop invites submissions related to various topics, including the robustness of X-domain-specific foundation models and adversarial attacks for social good [3][4]. Group 3: Competition Announcement - A competition will be held during the workshop, focusing on adversarial attacks against multimodal large language models (MLLMs), encouraging participants to design harmful image-text pairs [7]. - The competition will consist of preliminary and final rounds, with participants tasked to create adversarial pairs that trigger unsafe outputs from MLLMs [7].