Workflow
空间推理
icon
Search documents
引入几何约束后,VLM跨越了「空间推理」的认知鸿沟
机器之心· 2026-01-12 06:35
现有的视觉大模型普遍存在 「 语义-几何鸿沟」(Semantic-to-Geometric Gap) ,不仅分不清东南西北,更难以处理精确的空间量化任务。例如问「 你坐在沙发 上时,餐桌在你的哪一侧? 」,VLM 常常答错。 这种「 语义‑几何鸿沟」源自于视觉大模型的语义空间无法承载高保真的几何细节,导致其在空间推理时是在「 凭空瞎猜」,这使得模型读懂了画面的语义,却停 留在「 语言的世界」中,不具备现实世界赖以运行的几何直觉,导致空间判断漏洞百出。 针对这一痛点, 北京航空航天大学 与 上海人工智能实验室 的研究团队创新提出了 几何约束智能体(Geometrically-Constrained Agent, GCA) ,开创了 「 先形 式化约束,后确定性计算」 的空间推理新范式。GCA 不依赖海量数据微调,而是通过构建形式化任务约束,强制 VLM 从「 模糊直觉」转向「 精确求解」,通过 视觉工具调用和编写计算代码进行参数化计算,为空间推理搭建了一座可验证、确定性的几何桥梁。 GCA 直接带领 Qwen、Gemini 等基座模型实现「 能力跃迁」。在公认高难度的 MMSI-Bench 测试中,GCA 将模 ...
复杂空间推理新SOTA,性能提升55%,中山大学新作SpatialDreamer
3 6 Ke· 2025-12-22 10:12
【导读】中山大学等机构推出SpatialDreamer,通过主动心理想象和空间推理,显著提升了复杂空间任务的性能。模拟人类主动探索、想象和推理的过 程,解决了现有模型在视角变换等任务中的局限,为人工智能的空间智能发展开辟了新路径。 论文链接: https://arxiv.org/pdf/2512.07733 尽管多模态大语言模型(MLLMs)在场景理解方面取得了显著进展,但在需要心理模拟的复杂空间推理任务上表现仍然有限。 现有方法多依赖于对空间数据的被动观察,缺乏人类在空间认知中特有的主动想象与动态更新内部表征的能力。 例如,在需要变换视角以判断遮挡物体位置的任务中,现有模型往往因视角单一而推理失败。 为此,来自MBZUAI与中山大学的研究团队提出了SpatialDreamer,一个基于强化学习的框架,旨在通过主动探索、视觉想象与证据融合的闭环过程,赋 予MLLMs类人的空间心理模拟能力。 SpatialDreamer模拟人类的空间认知过程,构建了一个包含以下三个步骤的闭环推理流程: 1) 探索:模型根据当前场景推理出最优的自我中心动作(如「前进0.75米」或「左转45度」); 2) 想象:调用世界模型(如S ...
大模型被确诊「视觉文盲」!多校联合提出MILO,为它植入空间想象力
量子位· 2025-12-04 09:55
Core Insights - The article discusses the limitations of multi-modal large language models (MLLMs) in spatial reasoning, highlighting their inability to effectively understand and visualize spatial concepts, leading to a phenomenon termed "visual illiteracy" [2][3]. Group 1: Challenges in Spatial Reasoning - Spatial reasoning is identified as a core cognitive ability for humans to understand three-dimensional structures, which poses a significant challenge for MLLMs in practical applications [2]. - Current methods primarily rely on "language description tuning," which fails to provide models with a true visual understanding of spatial concepts [2][3]. Group 2: Introduction of MILO - A research team has proposed MILO (Implicit Spatial World Modeling) to address the spatial reasoning challenges faced by MLLMs by integrating visual generative feedback with symbolic reasoning [4]. - MILO employs a two-phase training process: the first phase involves visual generative tuning where the model learns spatial transformations through visual outputs, and the second phase involves language tuning using spatial instruction data [5]. Group 3: Enhancements in Geometric Perception - To further enhance geometric perception, the team introduced RePE (Relative Positional Encoding), which captures relative transformations between adjacent frames instead of relying on a global coordinate system, improving generalization and adaptability across datasets [8][9]. Group 4: GeoGen Dataset - The research team constructed the GeoGen dataset, comprising approximately 2,241 videos and 267,000 "observation-action-result" triplets, aimed at enhancing geometric perception generation [10]. - The dataset includes diverse sources such as scanned 3D scenes and internet videos, ensuring a wide range of realistic scenarios [11]. Group 5: Validation of MILO - The effectiveness of MILO was validated across multiple baseline models and five categories of spatial understanding tasks, achieving optimal performance in 3D scene understanding tasks and spatial reasoning tasks [12][16]. - Notably, MILO improved accuracy by 3.2% in the ScanRefer task and achieved an average accuracy of 61.7% in the VSI-Bench spatial reasoning task, surpassing the baseline VG-LLM by 2.2% [16].
NeurIPS 2025 | SURDS 数据集与 GRPO 全面强化自驾空间推理
自动驾驶之心· 2025-09-27 23:33
Core Insights - The article discusses the challenges of achieving accurate spatial reasoning in autonomous driving scenarios using Vision Language Models (VLMs), highlighting the lack of large-scale benchmarks in this area [2][20]. - A new benchmark called SURDS has been introduced to systematically evaluate the spatial reasoning capabilities of VLMs, revealing significant shortcomings in current models [4][20]. Benchmark Overview - SURDS is a large-scale benchmark based on the nuScenes dataset, consisting of 41,080 visual-question training instances and 9,250 evaluation samples, covering six spatial categories: direction recognition, pixel-level localization, depth estimation, distance comparison, left-right ordering, and front-back relationships [4][20]. - The dataset includes diverse multimodal information collected from urban environments in Boston and Singapore, ensuring a realistic testing scenario [6][20]. Model Training and Evaluation - The research emphasizes the importance of data generation and introduces a novel automated process for generating high-quality reasoning chains, which enhances the model's spatial reasoning capabilities [8][10]. - A reinforcement learning framework combining spatial localization rewards and logical consistency objectives was designed, leading to significant performance improvements in various tasks [11][20]. Experimental Results - The evaluation results show that different models exhibit notable differences in spatial reasoning tasks, with the proposed model achieving a nearly 60% improvement in depth estimation accuracy compared to the second-best model [14][20]. - The study reveals that most existing models struggle with single-object tasks, often performing close to random levels, indicating a need for better learning of absolute pose and metric information [16][20]. Training Strategy Insights - Ablation studies indicate that combining localization and logical rewards significantly enhances model performance, underscoring the foundational role of localization ability in spatial reasoning [16][18]. - The research also highlights that the scale of model parameters does not directly correlate with spatial understanding capabilities, suggesting that simply increasing model size is insufficient [16][20].
AI Lab最新InternSpatia:VLM空间推理数据集,显著提升模型能力
具身智能之心· 2025-06-24 14:09
Core Insights - The article discusses the limitations of current Vision-Language Models (VLMs) in spatial reasoning tasks, highlighting the need for improved datasets and methodologies to enhance performance in various scenarios [3][12]. Dataset Limitations - The existing InternSpatial dataset has three main limitations: 1. Limited scene diversity, focusing primarily on indoor and outdoor environments, lacking diverse contexts like driving and embodied navigation [3]. 2. Restricted instruction formats, only supporting natural language or region masks, which do not encompass the variety of queries found in real-world applications [3]. 3. Lack of multi-view supervision, with over 90% of data focusing on single-image reasoning, failing to model spatiotemporal relationships across views [3]. Evaluation Benchmark - The InternSpatial-Bench evaluation benchmark includes 6,008 QA pairs across five tasks, assessing position comparison, size comparison, rotation estimation, object counting, and existence estimation [7]. - The benchmark also introduces 1,000 additional QA pairs for multi-view rotation angle prediction [7]. Data Engine Design - The data engine employs a three-stage automated pipeline: 1. Annotation generation using existing annotations or SAM2 for mask generation [9]. 2. View alignment to construct a standard 3D coordinate system [9]. 3. Template-based QA generation with predefined task templates [9]. Experimental Results - Spatial reasoning performance has improved, with InternVL-Spatial-8B showing a 1.8% increase in position comparison accuracy and a 17% increase in object counting accuracy compared to its predecessor [10]. - The model's performance across various tasks demonstrates significant enhancements, particularly in multi-view tasks [10]. Instruction Format Robustness - Current models exhibit a 23% accuracy drop when using the <box> format, while training with InternSpatial reduces the gap between different formats to within 5% [12]. - However, the automated QA generation struggles to replicate the complexity of natural language, indicating a need for further refinement [12].
多模态模型挑战北京杭州地铁图!o3成绩显著,但跟人类有差距
量子位· 2025-06-07 05:02
ReasonMap团队 投稿 量子位 | 公众号 QbitAI 近年来,大语言模型(LLMs)以及多模态大模型(MLLMs)在多种场景理解和复杂推理任务中取得突破性进展。 然而,一个关键问题仍然值得追问: 多模态大模型(MLLMs),真的能"看懂图"了吗? 特别是在面对结构复杂、细节密集的图像时,它们是否具备细粒度视觉理解与空间推理能力,比如挑战一下高清 地铁图 这种。 为此,来自西湖大学、新加坡国立大学、浙江大学、华中科技大学的团队提出了一个全新的评测基准 ReasonMap 。 看得出来北京、杭州的地铁图难倒了一大片模型。 这是首个聚焦于 高分辨率交通图(主要为地铁图)的多模态推理评测基准,专为评估大模型在理解图像中细粒度的结构化空间信息 方面的 能力而设计。 结果发现,当前主流开源的多模态模型在ReasonMap上面临明显性能瓶颈,尤其在 跨线路路径规划 上常出现视觉混淆或站点遗漏。 而经强化学习后训练的闭源推理模型(如 GPT-o3)在多个维度上 显著优于 现有开源模型,但与人类水平相比仍存在明显差距。 在面对不同国家地区的地铁图中,四个代表性 MLLM(Qwen2.5-VL-72B-I(蓝色)、 I ...