Workflow
多模态大语言模型(MLLMs)
icon
Search documents
音频-视觉全模态的未来预测,FutureOmni给出了首份答卷
机器之心· 2026-01-24 01:53
复旦大学、上海创智学院与新加坡国立大学联合推出首个全模态未来预测评测基准 FutureOmni,要求模型从音频 - 视觉线索中预测未来事件, 实现跨模态因果和时间推理。包含 919 个视频和 1,034 个多选题问答对,在 13 个全模态模型 和 7 个纯视频模型 上的评估显示,当前系统在预 测未来事件方面存在显著困难,最佳准确率仅为 64.8%。 在日常生活中,人类不仅能理解「发生了什么」,更重要的是能够预测「将会发生什么」。看到乌云密布、听到雷声渐近,我们会主动关窗收衣;看到老师眉头 紧皱,反复强调某个知识点(听),我们知道接下来可能会有提问;看到球员起跳的动作和听到观众的惊呼,我们能够预判这是一个精彩的扣篮。 然而,现有的多模态大语言模型(MLLMs)虽然在全方位感知方面展现出强大的能力,但它们从音频 - 视觉线索中预测未来事件的能力仍然很大程度上未被探 索。现有的音视频模态基准主要关注回顾性理解 ⸺「视频中发生了什么」,而非前瞻性预测 ⸺「接下来会发生什么」。 现在,这一空白终于被填补了!复旦大学、上海创智学院与新加坡国立大学联合发布 FutureOmni ,不仅重新定义了多模态模型的「未来预测」评测 ...
从平面几何出发:形式化验证如何驱动MLLM的推理能力跃迁
机器之心· 2026-01-20 10:19
在迈向通用人工智能(AGI)的征途中,多模态大语言模型(MLLMs)虽然在视觉理解与文本生成上展现了惊人的能力,却始终面临一道难以逾越的鸿沟:如何 在复杂的数学与几何推理中,克服固有的幻觉与逻辑断层? 现有的 "结果导向" 训练往往掩盖了推理过程的脆弱性,导致模型常常 "蒙对答案" 却 "想错过程"。这 种 "黑盒" 式的学习方式,使得模型难以习得真正鲁棒的推理能力。 面对这一挑战,来自 上海 交通大学 、 复旦大学、香港 中文大学(深圳)、上海人工智能实验室等研究机构的团队 提出了一套全新的系统化解决方案: "Formal Enhance Informal Reasoning"(以形式化增强非形式化推理)。 该方案的核心洞察在于:利用领域内(In-Domain)极度严谨、可验证的形式化逻辑,可以作为一 种强有力的监督信号,去规范和引导模型在非形式化场景下的推理行为。 更进一步,研究发现这种在严谨数学环境中习得的逻辑素养,不仅仅局限于几何题,更 能作为一把通用的钥匙,解锁模型在通用数学乃至更广泛推理任务上的分布外(OOD)泛化能力。 基于这一理念,团队历经三个阶段的探索,构建了从数据底层到模型顶层的完整闭环: ...
深入感知级别图像理解:UniPercept 统一图像美学、质量与结构纹理感知
机器之心· 2026-01-08 02:06
Core Insights - The article discusses the development of UniPercept, a novel framework for perceptual image understanding that integrates aesthetics, quality, and structure & texture dimensions, addressing the limitations of existing multimodal large language models in understanding visual perception [3][5]. Group 1: Framework Overview - UniPercept is the first framework to unify three perceptual dimensions: aesthetics, quality, and structure & texture, enhancing the understanding of how images look beyond mere object recognition [3][5]. - The framework includes a hierarchical definition system and a large-scale benchmark dataset called UniPercept-Bench, which allows for comprehensive evaluation of image attributes [5][10]. Group 2: Evaluation System - UniPercept-Bench features a three-tiered evaluation system comprising 3 domains, 17 categories, and 44 criteria, providing detailed expert-level definitions that surpass previous image evaluation benchmarks [10][11]. - The evaluation dimensions include Image Aesthetics Assessment (IAA), Image Quality Assessment (IQA), and Image Structure & Texture Assessment (ISTA), each focusing on different aspects of image perception [11][12]. Group 3: Model Development - The model employs domain-adaptive pre-training using a dataset of approximately 800,000 samples, which helps it learn low-level visual features across domains [22]. - Task-aligned reinforcement learning is utilized to enhance the model's perceptual consistency, with specific reward functions designed for visual rating (VR) and visual question answering (VQA) tasks [23][25]. Group 4: Performance Metrics - UniPercept outperforms existing top models in various tasks, achieving the highest Spearman and Pearson correlation coefficients in aesthetics, quality, and structure assessments [29][30]. - In visual question answering tasks, UniPercept shows a significant accuracy improvement over leading models, particularly in identifying subtle damages in images [31]. Group 5: Applications - UniPercept demonstrates potential as a reward model for generative models, optimizing image generation by enhancing composition balance, detail sharpness, and structural richness [33][36]. - The framework's multi-dimensional reward signals work synergistically to improve both visual appeal and technical fidelity of generated images [37].
最鲁棒的MLLM,港科大开源「退化感知推理新范式」
3 6 Ke· 2025-12-24 07:47
Core Insights - The article discusses the breakthrough of Robust-R1, a new approach to multi-modal large language models (MLLMs) that addresses the critical issue of visual degradation in real-world applications, such as autonomous driving and medical imaging [1][2][23]. Group 1: Problem Identification - Visual degradation, including blurriness, noise, and occlusion, poses a significant challenge for advanced models like GPT-4V and Qwen-VL, hindering their deployment in key sectors [2][4]. - Existing methods rely on "implicit adaptation" strategies, which attempt to make models resistant to interference but fail to provide a comprehensive understanding of the degradation itself [2][3]. Group 2: Robust-R1 Solution - Robust-R1 introduces a paradigm shift by transforming the perception of visual degradation into an explicit structured reasoning task, allowing models to not only resist but also diagnose interference [2][3][24]. - The core idea of Robust-R1 is to construct a "degradation perception reasoning system" that follows a three-step diagnostic process: degradation diagnosis, semantic impact analysis, and robust conclusion generation [3][5]. Group 3: Technical Implementation - The first phase involves supervised fine-tuning with a structured reasoning chain, enabling the model to learn a "diagnose first, reason later" approach [9]. - The second phase introduces a degradation perception reward function to optimize the model's accuracy in identifying degradation types and intensities [10]. - The third phase employs a dynamic reasoning depth adjustment mechanism, allowing the model to adapt its reasoning based on the severity of degradation [10][11]. Group 4: Performance Validation - Robust-R1 has been tested against various benchmarks, achieving superior performance in understanding real-world degradation compared to existing models, with a comprehensive performance score of 0.5017 on the R-Bench benchmark [14][15]. - In stress tests with varying levels of synthetic degradation, Robust-R1 demonstrated significantly better robustness, maintaining usable accuracy even under extreme conditions [18]. Group 5: Implications and Future Directions - The development of Robust-R1 marks a significant transition in multi-modal models from striving for perfection in clear environments to making reliable decisions in complex realities [23][24]. - This innovation not only enhances the transparency and trustworthiness of AI models but also sets a new direction for robust MLLM research [24].
复杂空间推理新SOTA,性能提升55%,中山大学新作SpatialDreamer
3 6 Ke· 2025-12-22 10:12
Core Insights - SpatialDreamer, developed by institutions including Sun Yat-sen University, significantly enhances performance in complex spatial tasks through active mental imagery and spatial reasoning [1][4]. Group 1: Model Development - SpatialDreamer addresses limitations of existing models in perspective transformation tasks by simulating human-like active exploration and reasoning [1][4]. - The model transitions from passive observation to active goal-directed imagination, allowing it to autonomously decide what to observe and how to reason in a 3D environment [4]. Group 2: Methodology - The closed-loop reasoning process of SpatialDreamer consists of three steps: exploration, imagination, and reasoning [4]. - GeoPO, a strategy optimization method, combines tree sampling and geometric consistency constraints to enhance model performance and accelerate training convergence [4]. Group 3: Dataset and Learning - The SpatialDreamer-SFT dataset includes single-pass reasoning and reflective reasoning data, promoting a "think-imagine-answer" learning pattern [6]. Group 4: Experimental Results - SpatialDreamer achieved state-of-the-art (SOTA) accuracy of 93.9% and 92.5% on real and synthetic images in the SAT benchmark [7]. - It improved overall accuracy to 84.9% on the MindCube-Tiny benchmark, surpassing the baseline Qwen2.5-VL-7B by over 55% [7]. - In the VSI-Bench, it led in tasks such as object counting and path planning with an average accuracy of 62.2% [7].
港大领衔DrivePI:统一自动驾驶理解、感知、预测和规划的空间智能4D MLLM
自动驾驶之心· 2025-12-22 09:20
Core Viewpoint - DrivePI is introduced as a novel unified spatial-aware 4D multimodal large language model (MLLM) framework that integrates coarse-grained language understanding with fine-grained 3D perception capabilities, bridging the gap between vision-based and VLA paradigms in autonomous driving [2][38]. Group 1: Project Overview - DrivePI is developed collaboratively by Hong Kong University, leading the project with contributions from companies like Huawei and universities such as Tianjin University and Huazhong University of Science and Technology [2]. - The model is designed to perform spatial understanding, 3D perception, prediction, and planning tasks through end-to-end optimization, showcasing its capability to handle complex autonomous driving scenarios [4][6]. Group 2: Technical Innovations - DrivePI incorporates a multimodal perception approach, utilizing LiDAR alongside camera images to enhance spatial understanding and provide accurate 3D geometric information [11]. - The model generates intermediate fine 3D perception and prediction representations, ensuring reliable spatial awareness and enhancing the interpretability and safety of autonomous driving systems [11]. - A rich data engine is developed to seamlessly integrate 3D occupancy and flow representations into natural language scene descriptions, allowing the model to understand complex spatiotemporal dynamics [11]. Group 3: Performance Metrics - DrivePI outperforms existing VLA models, achieving a 2.5% higher average accuracy on nuScenes-QA compared to OpenDriveVLA-7B and reducing collision rates by 70% from 0.37% to 0.11% [5][16]. - In 3D occupancy and flow prediction, DrivePI achieved 49.3% OccScore and 49.3% RayIoU, surpassing the FB-OCC method by 10.3 percentage points [15][21]. - The model demonstrated a 32% reduction in L2 error for trajectory planning compared to VAD, showcasing its effectiveness in planning tasks [16]. Group 4: Data Engine and Annotation - The data engine for DrivePI operates in three main stages, focusing on generating diverse question-answer pairs for 4D spatial understanding and planning reasoning [12][18]. - Scene understanding annotations are generated to avoid confusion in distinguishing different views, enhancing the model's ability to interpret various perspectives [18]. Group 5: Ablation Studies and Insights - Ablation studies indicate that combining text and visual heads improves performance across most tasks, demonstrating the effectiveness of unifying text understanding with 3D perception, prediction, and planning [23]. - The impact of different text data scales was explored, revealing significant improvements in occupancy state prediction accuracy when increasing the training data size [26]. Group 6: Future Prospects - DrivePI is expected to inspire future research directions in autonomous driving by enhancing the interpretability and decision-making capabilities of systems through language reasoning and detailed 3D outputs [38].
超越英伟达Describe Anything,中科院 & 字节联合提出「GAR」,为DeepSeek-OCR添砖加瓦
3 6 Ke· 2025-10-28 07:26
Core Insights - DeepSeek-OCR has introduced a new concept called "Vision as Context Compression," focusing on using OCR capabilities to compress documents through images. The collaboration between the Chinese Academy of Sciences and ByteDance has proposed "Grasp Any Region" (GAR) as a new approach to explore whether natural images can also serve as text compression [1]. Group 1: GAR Capabilities - GAR achieves precise region captioning, providing a potential pathway for constructing dense captions for natural images [2]. - GAR possesses three main capabilities: accurate description of user-specified regions, modeling relationships between multiple regions, and performing complex combinatorial reasoning [5][6]. Group 2: Comparison with Existing Models - GAR demonstrates superior performance in accurately understanding user-specified regions compared to existing models like DAM, which often misidentify objects [9][40]. - GAR can accurately identify and describe very small objects, showcasing its detailed understanding capabilities [11][16]. Group 3: Technical Innovations - The GAR model integrates fine-grained understanding of specified regions while retaining global context, achieved through a novel prompt encoding scheme and Region of Interest (RoI)-aligned feature replay technology [25][28]. - The model's design allows it to focus on details without neglecting the overall context, which is crucial for accurate reasoning about complex relationships between objects [27][30]. Group 4: Data and Training - GAR was trained using a large-scale, high-quality dataset, including 456,000 fine-grained descriptions and 414,000 samples for relational understanding [30][35]. - The training process involved leveraging the Panoptic Scene Graph dataset to enhance multi-region relational reasoning capabilities [32]. Group 5: Benchmark Performance - GAR-8B achieved a score of 59.9 on the GAR-Bench-VQA test set, outperforming advanced models like GPT-4o and approaching the performance of top reasoning models [39]. - In the GAR-Bench-Cap test set, GAR-1B and GAR-8B scored 57.5 and 62.2, respectively, indicating their leading position in generating detailed and accurate local descriptions [41]. Group 6: Applications and Future Potential - GAR can be utilized as a data engine for training multimodal understanding models, enhancing instruction-following capabilities in text-to-image or text-to-video models, and providing precise descriptions for editing tasks [47]. - The model's open-source nature and support for local deployment via Gradio make it accessible for various applications [48].
大模型在具身推理上「翻车」了?4496 道题全面揭示短板
机器之心· 2025-10-28 00:41
Core Insights - The article focuses on the evaluation of multimodal large language models (MLLMs) in embodied intelligence tasks, providing detailed failure analysis and proposing an agent algorithm for improvement [25]. Group 1: Embodied Intelligence and MLLMs - Embodied intelligence is a concept where an agent can complete a closed-loop of perception, understanding, and decision-making in an environment, relying on various skills [2]. - Many excellent works have deployed MLLMs in different applications of embodied intelligence, but evaluations have mainly focused on subfields like pointing and spatial reasoning [2][4]. Group 2: BEAR Benchmark - The BEAR benchmark was proposed by Northeastern University in collaboration with other institutions to systematically evaluate MLLMs across various sub-capabilities, providing detailed error analysis and algorithm enhancements [4]. - BEAR includes 4,469 image-video-text VQA tasks and covers six major categories, including five foundational categories and a sixth long-range reasoning category, breaking down tasks into 14 different skills [8][9]. Group 3: Evaluation Results - The evaluation measured 20 different MLLMs, revealing that the best-performing model, GPT-5, only achieved a 52% success rate on the BEAR benchmark [11]. - Closed-source models generally performed better than open-source models, although some open-source models like the InternVL series showed strong potential, outperforming models like GPT-4o and Claude [11]. Group 4: Error Analysis - A fine-grained error analysis of GPT-4o revealed interesting findings, indicating that the model's visual capabilities are a major bottleneck across multiple categories, particularly in language grounding and trajectory understanding [19]. - The analysis showed that 88% of errors in long-range reasoning were attributed to lower-level perception and spatial reasoning issues [19]. Group 5: BEAR-Agent Development - The authors developed BEAR-Agent, a multimodal agent designed to enhance visual reasoning capabilities by providing tools and drawing auxiliary lines, significantly improving performance on the BEAR benchmark [17]. - The performance of both the best open-source model (InternVL3-14B) and the closed-source model (GPT-5) improved significantly with the integration of BEAR-Agent [17]. Group 6: Simulation Testing - Further experiments in a desktop manipulation environment demonstrated that BEAR-Agent improved the performance of MOKA by 20.17%, indicating its potential for embodied agents [21].
NeurIPS2025 | 攻破闭源多模态大模型:一种基于特征最优对齐的新型对抗攻击方法
机器之心· 2025-10-17 04:09
Core Insights - The article discusses the advancements and security vulnerabilities of Multimodal Large Language Models (MLLMs), particularly their susceptibility to adversarial attacks [2][8] - It introduces a novel attack framework called FOA-Attack, which enhances the transferability of adversarial samples across different models by optimizing feature alignment at both global and local levels [3][11] Group 1: Background and Motivation - MLLMs like GPT-4 and Claude-3 exhibit exceptional performance in tasks such as image understanding and visual question answering, but they inherit vulnerabilities from their visual encoders, making them prone to adversarial attacks [8][10] - Adversarial attacks can be categorized into non-targeted (aiming to produce incorrect outputs) and targeted (aiming for specific outputs), with the latter being particularly challenging in black-box scenarios where model internals are inaccessible [10][11] Group 2: FOA-Attack Framework - FOA-Attack employs a dual-dimensional alignment strategy, focusing on both global features (using cosine similarity loss for [CLS] tokens) and local features (using clustering and optimal transport for patch tokens) to improve transferability [6][11] - The framework includes a dynamic weight integration strategy that adapts the influence of multiple models during the attack generation process, enhancing the overall effectiveness of the attack [6][11] Group 3: Experimental Results - FOA-Attack significantly outperforms existing state-of-the-art methods in both open-source and closed-source MLLMs, achieving remarkable success rates, particularly against commercial closed-source models like GPT-4 [4][19] - In experiments, FOA-Attack achieved an attack success rate (ASR) of 75.1% against GPT-4, showcasing its effectiveness in real-world applications [19][24] Group 4: Conclusion and Future Directions - The findings highlight the vulnerabilities of current MLLMs in the visual encoding phase and suggest new defensive strategies, particularly in fortifying local feature robustness [24][25] - The authors have made the paper and code publicly available for further exploration and discussion, indicating a commitment to advancing research in this area [25][27]
景不动人动,MLLM如何面对「移步换景」的真实世界?OST-Bench揭示多模态大模型在线时空理解短板
3 6 Ke· 2025-10-14 08:54
Core Insights - The introduction of OST-Bench presents a new challenge for multimodal large language models (MLLMs) by focusing on dynamic online scene understanding, contrasting with traditional offline benchmarks [1][3][12] - OST-Bench emphasizes the necessity for models to perform real-time perception, memory maintenance, and spatiotemporal reasoning based on continuous local observations [3][4][12] Benchmark Characteristics - OST-Bench is designed to reflect real-world challenges more accurately than previous benchmarks, featuring two main characteristics: online settings requiring real-time processing and cross-temporal understanding that integrates current and historical information [3][4][12] - The benchmark categorizes dynamic scene understanding into three information types: agent spatial state, visible information, and agent-object spatial relationships, leading to the creation of 15 sub-tasks [7][12] Experimental Results - The performance of various models on OST-Bench reveals significant gaps between current MLLMs and human-level performance, particularly in complex spatiotemporal reasoning tasks [12][21] - Models like Claude-3.5-Sonnet and GPT-4.1 show varying degrees of success across different tasks, with human-level performance significantly higher than that of the models [9][10][12] Model Limitations - Current MLLMs exhibit a tendency to take shortcuts in reasoning, often relying on limited information rather than comprehensive spatiotemporal integration, which is termed "spatio-temporal reasoning shortcut" [15][18] - The study identifies that the models struggle with long-sequence online settings, indicating a need for improved mechanisms for complex spatial reasoning and long-term memory retrieval [12][21] Future Directions - The findings from OST-Bench suggest that enhancing complex spatial reasoning capabilities and long-term memory mechanisms will be crucial for the next generation of multimodal models to achieve real-world intelligence [22]