空间推理 - filings, earnings calls, financial reports, news

空间推理

Search documents

NeurIPS 2025 | SURDS 数据集与 GRPO 全面强化自驾空间推理

自动驾驶之心· 2025-09-27 23:33

以下文章来源于深蓝AI ，作者深蓝学院深蓝AI . 专注于人工智能、机器人与自动驾驶的学习平台。作者 | 深蓝学院来源 | 深蓝AI 点击下方卡片，关注" 自动驾驶之心 "公众号戳我-> 领取自动驾驶近30个方向学习路线 >>自动驾驶前沿信息获取 → 自动驾驶之心知识星球本文只做学术分享，如有侵权，联系删文摘要在大模型飞速发展的当下，让多模态大语言模型（VLM）在自动驾驶场景图像中做出准确的空间推理，依然是人工智能领域的一大挑战。学术界一直缺乏针对自动驾驶场推理的大规模基准，现有方法往往依赖外部专家模型，难以全面衡量模型能力。与此形成鲜明对比的是，人类可以凭借已有知识轻松判断图像中物体的朝向，或推理多个物体的相对位置。而VLM同样具备丰富的知识，却仍在此类任务上表现不足。为此，武汉大学联合中科院自动化所，北京智源人工智能研究院 (BAAI)等多家单位推出首个面向驾驶场景的VLM空间推理大规模基准 SURDS ，系统评测了包括 GPT 系列在内的通用模型及 SpatialRGPT 等空间推理模型，全面揭示了当前VLM在空间理解方面的短板。研究团队通过设计"感知准确性"和" ...

具身智能之心· 2025-06-24 14:09

Core Insights - The article discusses the limitations of current Vision-Language Models (VLMs) in spatial reasoning tasks, highlighting the need for improved datasets and methodologies to enhance performance in various scenarios [3][12]. Dataset Limitations - The existing InternSpatial dataset has three main limitations: 1. Limited scene diversity, focusing primarily on indoor and outdoor environments, lacking diverse contexts like driving and embodied navigation [3]. 2. Restricted instruction formats, only supporting natural language or region masks, which do not encompass the variety of queries found in real-world applications [3]. 3. Lack of multi-view supervision, with over 90% of data focusing on single-image reasoning, failing to model spatiotemporal relationships across views [3]. Evaluation Benchmark - The InternSpatial-Bench evaluation benchmark includes 6,008 QA pairs across five tasks, assessing position comparison, size comparison, rotation estimation, object counting, and existence estimation [7]. - The benchmark also introduces 1,000 additional QA pairs for multi-view rotation angle prediction [7]. Data Engine Design - The data engine employs a three-stage automated pipeline: 1. Annotation generation using existing annotations or SAM2 for mask generation [9]. 2. View alignment to construct a standard 3D coordinate system [9]. 3. Template-based QA generation with predefined task templates [9]. Experimental Results - Spatial reasoning performance has improved, with InternVL-Spatial-8B showing a 1.8% increase in position comparison accuracy and a 17% increase in object counting accuracy compared to its predecessor [10]. - The model's performance across various tasks demonstrates significant enhancements, particularly in multi-view tasks [10]. Instruction Format Robustness - Current models exhibit a 23% accuracy drop when using the <box> format, while training with InternSpatial reduces the gap between different formats to within 5% [12]. - However, the automated QA generation struggles to replicate the complexity of natural language, indicating a need for further refinement [12].

InternSpatial-Bench评估基准

InternSpatial-Bench评估基准

多模态模型挑战北京杭州地铁图！o3成绩显著，但跟人类有差距

量子位· 2025-06-07 05:02

ReasonMap团队投稿量子位 | 公众号 QbitAI 近年来，大语言模型（LLMs）以及多模态大模型（MLLMs）在多种场景理解和复杂推理任务中取得突破性进展。然而，一个关键问题仍然值得追问：多模态大模型（MLLMs），真的能"看懂图"了吗？特别是在面对结构复杂、细节密集的图像时，它们是否具备细粒度视觉理解与空间推理能力，比如挑战一下高清地铁图这种。为此，来自西湖大学、新加坡国立大学、浙江大学、华中科技大学的团队提出了一个全新的评测基准 ReasonMap 。看得出来北京、杭州的地铁图难倒了一大片模型。这是首个聚焦于高分辨率交通图（主要为地铁图）的多模态推理评测基准，专为评估大模型在理解图像中细粒度的结构化空间信息方面的能力而设计。结果发现，当前主流开源的多模态模型在ReasonMap上面临明显性能瓶颈，尤其在跨线路路径规划上常出现视觉混淆或站点遗漏。而经强化学习后训练的闭源推理模型（如 GPT-o3）在多个维度上显著优于现有开源模型，但与人类水平相比仍存在明显差距。在面对不同国家地区的地铁图中，四个代表性 MLLM（Qwen2.5-VL-72B-I（蓝色）、 I ...