多模态大语言模型(MLLMs)
Search documents
深入感知级别图像理解:UniPercept 统一图像美学、质量与结构纹理感知
机器之心· 2026-01-08 02:06
近日,来自上海人工智能实验室、中科大、北大、清华等机构的研究者联合发布了 UniPercept 。这是首个统一了 美学(Aesthetics) 、 质 量 (Quality) 、 结构与纹理(Structure & Texture) 三个维度的感知级图像理解框架。 操铄 :中国科学技术大学与上海人工智能实验室联合培养博士生,专注多模态图像理解与生成。主导研发了 ArtiMuse、UniPercept 等成果,多篇工作发 表于 ECCV、ICCV 等国际顶级会议。 李佳阳 :北京大学硕士生,专注多模态图像理解及融合。作为核心作者参与了 ArtiMuse、UniPercept 等工作,多篇工作发表于 TIP、TPAMI 等国际顶级 期刊。 尽管多模态大语言模型(MLLMs)在识别「图中有什么」这一语义层面上取得了巨大进步,但在理解「图像看起来怎么样」这一感知层面上仍显乏力。 UniPercept-Bench: 项目主页: https://thunderbolt215.github.io/Unipercept-project/ 代码仓库: https://github.com/thunderbolt215/UniP ...
最鲁棒的MLLM,港科大开源「退化感知推理新范式」
3 6 Ke· 2025-12-24 07:47
这些在真实世界中无处不在的视觉退化,足以让最先进的GPT-4V、Qwen-VL等模型产生荒谬输出,成为其在自动驾驶、医疗影像、安防监控等关键领域 落地的「阿喀琉斯之踵」。 现有方法的根本困境在于「隐式适应」:通过对抗训练、数据增强等手段,试图让模型「硬扛」干扰。 这如同给模型戴上更厚的滤镜——治标不治本,且不可解释。模型在特定退化上表现提升,却无法理解退化本身,更无法泛化到未知干扰,其决策过程仍 是黑箱。 【导读】多模态大语言模型(MLLMs)已成为AI视觉理解的核心引擎,但其在真实世界视觉退化(模糊、噪声、遮挡等)下的性能崩溃,始终是制约产 业落地的致命瓶颈。近日,一篇被AAAI 2026接收为Oral的重磅论文Robust-R1,给出了革命性解法:来自香港科技大学、西北工业大学等团队首次跳出 「隐式适应」的思维定式,将视觉退化问题重构为显式结构化推理任务,让模型不仅「抗干扰」,更能「诊干扰」,在多项权威评测中实现质量与鲁棒性 的双重突破。 当多模态大模型(MLLMs)从实验室走向真实世界,它们遇到了一个致命瓶颈:视觉退化。 雨滴斑驳的车窗、年代久远的监控录像、网络压缩的低质图片、医疗影像的固有噪声…… 今 ...
复杂空间推理新SOTA,性能提升55%,中山大学新作SpatialDreamer
3 6 Ke· 2025-12-22 10:12
【导读】中山大学等机构推出SpatialDreamer,通过主动心理想象和空间推理,显著提升了复杂空间任务的性能。模拟人类主动探索、想象和推理的过 程,解决了现有模型在视角变换等任务中的局限,为人工智能的空间智能发展开辟了新路径。 论文链接: https://arxiv.org/pdf/2512.07733 尽管多模态大语言模型(MLLMs)在场景理解方面取得了显著进展,但在需要心理模拟的复杂空间推理任务上表现仍然有限。 现有方法多依赖于对空间数据的被动观察,缺乏人类在空间认知中特有的主动想象与动态更新内部表征的能力。 例如,在需要变换视角以判断遮挡物体位置的任务中,现有模型往往因视角单一而推理失败。 为此,来自MBZUAI与中山大学的研究团队提出了SpatialDreamer,一个基于强化学习的框架,旨在通过主动探索、视觉想象与证据融合的闭环过程,赋 予MLLMs类人的空间心理模拟能力。 SpatialDreamer模拟人类的空间认知过程,构建了一个包含以下三个步骤的闭环推理流程: 1) 探索:模型根据当前场景推理出最优的自我中心动作(如「前进0.75米」或「左转45度」); 2) 想象:调用世界模型(如S ...
港大领衔DrivePI:统一自动驾驶理解、感知、预测和规划的空间智能4D MLLM
自动驾驶之心· 2025-12-22 09:20
编辑 | 自动驾驶之心 点击下方 卡片 ,关注" 自动驾驶之心 "公众号 戳我-> 领取 自动驾驶近30个 方向 学习 路线 >>自动驾驶前沿信息获取 → 自动驾驶之心知识星球 论文作者 | Zhe Liu等 尽管多模态大语言模型(MLLMs)在各种领域展示了强大的能力,但它们在自动驾驶中生成精细化3D感知和预测输出的应用仍有待探索。本文提出了DrivePI,一种新 型的空间感知4D MLLM,作为统一的视觉-语言-行为(VLA)框架,同时兼容视觉-行为(VA)模型。我们的方法通过端到端优化,并行执行空间理解、3D感知(如3D占用 体素)、预测(如占用流)和规划(如动作输出)任务。为了获取精确的几何信息和丰富的视觉外观,我们的方法在统一的MLLM架构中集成了点云、多视角图像和语言指 令。我们还开发了一个数据引擎,用于生成文本-占用和文本-流问答对,以实现4D空间理解。 值得注意的是,仅使用0.5B参数的Qwen2.5模型作为MLLM主干网络,DrivePI作为单一统一模型,性能已经匹配或超越了现有的VLA模型和专业的VA模型。具体而 言,与VLA模型相比,DrivePI在nuScenes-QA上的平均准确率比 ...
超越英伟达Describe Anything,中科院 & 字节联合提出「GAR」,为DeepSeek-OCR添砖加瓦
3 6 Ke· 2025-10-28 07:26
Core Insights - DeepSeek-OCR has introduced a new concept called "Vision as Context Compression," focusing on using OCR capabilities to compress documents through images. The collaboration between the Chinese Academy of Sciences and ByteDance has proposed "Grasp Any Region" (GAR) as a new approach to explore whether natural images can also serve as text compression [1]. Group 1: GAR Capabilities - GAR achieves precise region captioning, providing a potential pathway for constructing dense captions for natural images [2]. - GAR possesses three main capabilities: accurate description of user-specified regions, modeling relationships between multiple regions, and performing complex combinatorial reasoning [5][6]. Group 2: Comparison with Existing Models - GAR demonstrates superior performance in accurately understanding user-specified regions compared to existing models like DAM, which often misidentify objects [9][40]. - GAR can accurately identify and describe very small objects, showcasing its detailed understanding capabilities [11][16]. Group 3: Technical Innovations - The GAR model integrates fine-grained understanding of specified regions while retaining global context, achieved through a novel prompt encoding scheme and Region of Interest (RoI)-aligned feature replay technology [25][28]. - The model's design allows it to focus on details without neglecting the overall context, which is crucial for accurate reasoning about complex relationships between objects [27][30]. Group 4: Data and Training - GAR was trained using a large-scale, high-quality dataset, including 456,000 fine-grained descriptions and 414,000 samples for relational understanding [30][35]. - The training process involved leveraging the Panoptic Scene Graph dataset to enhance multi-region relational reasoning capabilities [32]. Group 5: Benchmark Performance - GAR-8B achieved a score of 59.9 on the GAR-Bench-VQA test set, outperforming advanced models like GPT-4o and approaching the performance of top reasoning models [39]. - In the GAR-Bench-Cap test set, GAR-1B and GAR-8B scored 57.5 and 62.2, respectively, indicating their leading position in generating detailed and accurate local descriptions [41]. Group 6: Applications and Future Potential - GAR can be utilized as a data engine for training multimodal understanding models, enhancing instruction-following capabilities in text-to-image or text-to-video models, and providing precise descriptions for editing tasks [47]. - The model's open-source nature and support for local deployment via Gradio make it accessible for various applications [48].
大模型在具身推理上「翻车」了?4496 道题全面揭示短板
机器之心· 2025-10-28 00:41
Core Insights - The article focuses on the evaluation of multimodal large language models (MLLMs) in embodied intelligence tasks, providing detailed failure analysis and proposing an agent algorithm for improvement [25]. Group 1: Embodied Intelligence and MLLMs - Embodied intelligence is a concept where an agent can complete a closed-loop of perception, understanding, and decision-making in an environment, relying on various skills [2]. - Many excellent works have deployed MLLMs in different applications of embodied intelligence, but evaluations have mainly focused on subfields like pointing and spatial reasoning [2][4]. Group 2: BEAR Benchmark - The BEAR benchmark was proposed by Northeastern University in collaboration with other institutions to systematically evaluate MLLMs across various sub-capabilities, providing detailed error analysis and algorithm enhancements [4]. - BEAR includes 4,469 image-video-text VQA tasks and covers six major categories, including five foundational categories and a sixth long-range reasoning category, breaking down tasks into 14 different skills [8][9]. Group 3: Evaluation Results - The evaluation measured 20 different MLLMs, revealing that the best-performing model, GPT-5, only achieved a 52% success rate on the BEAR benchmark [11]. - Closed-source models generally performed better than open-source models, although some open-source models like the InternVL series showed strong potential, outperforming models like GPT-4o and Claude [11]. Group 4: Error Analysis - A fine-grained error analysis of GPT-4o revealed interesting findings, indicating that the model's visual capabilities are a major bottleneck across multiple categories, particularly in language grounding and trajectory understanding [19]. - The analysis showed that 88% of errors in long-range reasoning were attributed to lower-level perception and spatial reasoning issues [19]. Group 5: BEAR-Agent Development - The authors developed BEAR-Agent, a multimodal agent designed to enhance visual reasoning capabilities by providing tools and drawing auxiliary lines, significantly improving performance on the BEAR benchmark [17]. - The performance of both the best open-source model (InternVL3-14B) and the closed-source model (GPT-5) improved significantly with the integration of BEAR-Agent [17]. Group 6: Simulation Testing - Further experiments in a desktop manipulation environment demonstrated that BEAR-Agent improved the performance of MOKA by 20.17%, indicating its potential for embodied agents [21].
NeurIPS2025 | 攻破闭源多模态大模型:一种基于特征最优对齐的新型对抗攻击方法
机器之心· 2025-10-17 04:09
Core Insights - The article discusses the advancements and security vulnerabilities of Multimodal Large Language Models (MLLMs), particularly their susceptibility to adversarial attacks [2][8] - It introduces a novel attack framework called FOA-Attack, which enhances the transferability of adversarial samples across different models by optimizing feature alignment at both global and local levels [3][11] Group 1: Background and Motivation - MLLMs like GPT-4 and Claude-3 exhibit exceptional performance in tasks such as image understanding and visual question answering, but they inherit vulnerabilities from their visual encoders, making them prone to adversarial attacks [8][10] - Adversarial attacks can be categorized into non-targeted (aiming to produce incorrect outputs) and targeted (aiming for specific outputs), with the latter being particularly challenging in black-box scenarios where model internals are inaccessible [10][11] Group 2: FOA-Attack Framework - FOA-Attack employs a dual-dimensional alignment strategy, focusing on both global features (using cosine similarity loss for [CLS] tokens) and local features (using clustering and optimal transport for patch tokens) to improve transferability [6][11] - The framework includes a dynamic weight integration strategy that adapts the influence of multiple models during the attack generation process, enhancing the overall effectiveness of the attack [6][11] Group 3: Experimental Results - FOA-Attack significantly outperforms existing state-of-the-art methods in both open-source and closed-source MLLMs, achieving remarkable success rates, particularly against commercial closed-source models like GPT-4 [4][19] - In experiments, FOA-Attack achieved an attack success rate (ASR) of 75.1% against GPT-4, showcasing its effectiveness in real-world applications [19][24] Group 4: Conclusion and Future Directions - The findings highlight the vulnerabilities of current MLLMs in the visual encoding phase and suggest new defensive strategies, particularly in fortifying local feature robustness [24][25] - The authors have made the paper and code publicly available for further exploration and discussion, indicating a commitment to advancing research in this area [25][27]
景不动人动,MLLM如何面对「移步换景」的真实世界?OST-Bench揭示多模态大模型在线时空理解短板
3 6 Ke· 2025-10-14 08:54
Core Insights - The introduction of OST-Bench presents a new challenge for multimodal large language models (MLLMs) by focusing on dynamic online scene understanding, contrasting with traditional offline benchmarks [1][3][12] - OST-Bench emphasizes the necessity for models to perform real-time perception, memory maintenance, and spatiotemporal reasoning based on continuous local observations [3][4][12] Benchmark Characteristics - OST-Bench is designed to reflect real-world challenges more accurately than previous benchmarks, featuring two main characteristics: online settings requiring real-time processing and cross-temporal understanding that integrates current and historical information [3][4][12] - The benchmark categorizes dynamic scene understanding into three information types: agent spatial state, visible information, and agent-object spatial relationships, leading to the creation of 15 sub-tasks [7][12] Experimental Results - The performance of various models on OST-Bench reveals significant gaps between current MLLMs and human-level performance, particularly in complex spatiotemporal reasoning tasks [12][21] - Models like Claude-3.5-Sonnet and GPT-4.1 show varying degrees of success across different tasks, with human-level performance significantly higher than that of the models [9][10][12] Model Limitations - Current MLLMs exhibit a tendency to take shortcuts in reasoning, often relying on limited information rather than comprehensive spatiotemporal integration, which is termed "spatio-temporal reasoning shortcut" [15][18] - The study identifies that the models struggle with long-sequence online settings, indicating a need for improved mechanisms for complex spatial reasoning and long-term memory retrieval [12][21] Future Directions - The findings from OST-Bench suggest that enhancing complex spatial reasoning capabilities and long-term memory mechanisms will be crucial for the next generation of multimodal models to achieve real-world intelligence [22]
景不动人动,MLLM如何面对「移步换景」的真实世界?OST-Bench揭示多模态大模型在线时空理解短板
机器之心· 2025-10-14 06:33
Core Insights - The article discusses the introduction of OST-Bench, a new benchmark for evaluating multi-modal large language models (MLLMs) in dynamic online environments, emphasizing the challenges of real-world embodied perception and reasoning [2][24]. Group 1: Benchmark Characteristics - OST-Bench reflects the core challenges of embodied perception in real-world settings, contrasting with traditional offline benchmarks that do not account for dynamic scene exploration [2][7]. - The benchmark is designed to assess models' abilities to perform real-time perception, memory maintenance, and spatiotemporal reasoning based on continuous local observations [7][10]. - It includes 15 sub-tasks categorized into judgment, estimation, counting, and temporal localization, with a dataset comprising 10,000 test samples and 50,000 training samples [8][10]. Group 2: Model Performance and Challenges - Current mainstream MLLMs show significant performance gaps compared to human capabilities, particularly in cross-temporal information reasoning [17]. - Models struggle with complex spatiotemporal reasoning tasks, often resorting to "spatio-temporal reasoning shortcuts," leading to superficial answers without adequate reasoning [18][21]. - Fine-tuning experiments indicate that while models can improve their scores by over 10% with additional training data, they still fail to achieve over 50% accuracy in complex reasoning tasks, highlighting the need for better model design and training strategies [23][24].
给几何图片写标题就能让AI更聪明,UIUC发布高质量可泛化几何数据集
机器之心· 2025-09-25 23:54
Core Viewpoint - The article discusses the advancements in multi-modal large language models (MLLMs) and introduces a new framework called Geo-Image-Textualization, which addresses the limitations in geometric reasoning tasks by ensuring complete alignment between visual and textual information [1][21]. Group 1: Framework and Dataset - A research team from UIUC has proposed a reinforcement learning-based data generation and optimization framework called Geo-Image-Textualization, along with the release of the first fully aligned high-quality geometric image-text dataset, GeoReasoning-10K, which contains 10,000 carefully constructed image-description pairs [2][3]. - The GeoReasoning-10K dataset and related code have been made publicly available to promote community development [3][5]. Group 2: Innovations and Performance - The core innovations of the framework include a generation process for image-title-question/answer pairs, which enhances the model's performance in geometric reasoning tasks [6][8]. - The trained model demonstrates strong generalization capabilities, performing well not only in geometric tasks but also in arithmetic, algebra, and numerical reasoning, even with non-geometric image inputs [8]. - Models trained with GeoReasoning outperform other similar datasets in downstream tasks and exhibit good scalability [8][12]. Group 3: Experimental Results - In authoritative mathematical reasoning benchmarks MathVista and MathVerse, GeoReasoning-10K achieved optimal results compared to other geometric captioning datasets, showcasing superior data quality and extensibility [12][14]. - The article presents specific examples from the MathVista benchmark, illustrating the model's ability to solve complex geometric problems effectively [16][21]. Group 4: Future Implications - The Geo-Image-Textualization framework and GeoReasoning-10K dataset provide a new approach to overcoming the bottlenecks in geometric reasoning, enhancing the overall mathematical reasoning capabilities of AI models, and paving the way for applications in education and scientific computation [21][22].