Workflow
空间推理
icon
Search documents
CVPR 2026 Workshop征稿|从感知到推理,ViSCALE 2.0 邀你重塑计算机视觉的 System 2
机器之心· 2026-02-13 04:19
Core Insights - The article discusses the evolution of computer vision towards a new paradigm, emphasizing the transition from basic pixel perception to complex spatial reasoning and world modeling, facilitated by Test-time Scaling (TTS) [2][5] - The upcoming ViSCALE 2026 conference aims to gather leading scholars to explore breakthroughs in visual models through computational expansion, focusing on deep reasoning rather than mere static outputs [4][5] Group 1: Conference Highlights - ViSCALE 2026 will feature discussions on spatial intelligence and world models, with contributions from top scholars including Sergey Levine, Manling Li, and Ziwei Liu [5] - The conference encourages innovative research submissions that challenge existing visual model limitations, providing a platform for both theoretical and application-focused studies [7] Group 2: Key Topics of Discussion - The conference will cover various topics, including: - Enhancing video generation's physical consistency and long-term causal reasoning through TTS [6] - Breaking 2D limitations to enable models to navigate and operate in 3D spaces like humans [6] - Developing visual reasoning chains that allow models to self-correct and engage in multi-step reasoning [6] - Exploring scaling laws that relate computational load during testing to visual reasoning performance [6] Group 3: Submission Details - The conference invites submissions in two tracks: Full Papers (8 pages) and Extended Abstracts (up to 4 pages), with specific formatting requirements [9] - Important deadlines include submission by March 10, 2026, and notification of acceptance by March 18, 2026 [9]
引入几何约束后,VLM跨越了「空间推理」的认知鸿沟
机器之心· 2026-01-12 06:35
Core Insights - The article discusses the "Semantic-to-Geometric Gap" in existing Visual Language Models (VLMs), which struggle with precise spatial reasoning tasks, leading to incorrect answers in spatial queries [2][6]. Group 1: Problem Identification - The "Semantic-to-Geometric Gap" arises because VLMs compress rich pixel information into abstract semantic features, losing high-fidelity geometric details necessary for accurate spatial reasoning [7]. - VLMs lack the ability to form precise geometric imaginations, which hampers their performance in complex spatial reasoning scenarios [7]. Group 2: Proposed Solution - A research team from Beihang University and Shanghai AI Lab introduced the Geometrically-Constrained Agent (GCA), which employs a new paradigm of "formalizing constraints before deterministic computation" to enhance spatial reasoning capabilities [4]. - GCA does not rely on massive data fine-tuning but instead uses formal task constraints to shift VLMs from "fuzzy intuition" to "precise solving," creating a verifiable geometric bridge for spatial reasoning [4]. Group 3: Performance Improvement - GCA significantly improved model performance by nearly 50% in the challenging MMSI-Bench test, establishing a new state-of-the-art (SOTA) in the field of spatial reasoning [4][14]. - The average accuracy achieved by GCA is 65.1%, surpassing existing training-based and tool-integrated methods, particularly in complex spatial reasoning tasks [15]. Group 4: Generalizability and Versatility - GCA is a training-free universal reasoning paradigm that can empower various foundational models, achieving an average relative performance improvement of about 37% on the MMSI-Bench [16]. - The GCA framework demonstrated exceptional performance, with the Gemini-2.5-Pro model's accuracy rising from 36.9% to 55.0% after integration [16]. Group 5: Methodology - GCA's approach involves two stages: formalizing tasks from "fuzzy instructions" to "precise rules" and then performing deterministic geometric calculations within established constraints [9][12]. - The framework includes intelligent tool scheduling and binding, ensuring seamless integration of perception and computation tools to achieve reliable spatial reasoning [20]. Group 6: Conclusion and Implications - GCA represents a new paradigm of "language-defined constraints and geometric execution," effectively transforming vague spatial queries into constrained mathematical problems, thus enhancing reasoning accuracy and moving machines closer to possessing "geometric intuition" [24].
复杂空间推理新SOTA,性能提升55%,中山大学新作SpatialDreamer
3 6 Ke· 2025-12-22 10:12
Core Insights - SpatialDreamer, developed by institutions including Sun Yat-sen University, significantly enhances performance in complex spatial tasks through active mental imagery and spatial reasoning [1][4]. Group 1: Model Development - SpatialDreamer addresses limitations of existing models in perspective transformation tasks by simulating human-like active exploration and reasoning [1][4]. - The model transitions from passive observation to active goal-directed imagination, allowing it to autonomously decide what to observe and how to reason in a 3D environment [4]. Group 2: Methodology - The closed-loop reasoning process of SpatialDreamer consists of three steps: exploration, imagination, and reasoning [4]. - GeoPO, a strategy optimization method, combines tree sampling and geometric consistency constraints to enhance model performance and accelerate training convergence [4]. Group 3: Dataset and Learning - The SpatialDreamer-SFT dataset includes single-pass reasoning and reflective reasoning data, promoting a "think-imagine-answer" learning pattern [6]. Group 4: Experimental Results - SpatialDreamer achieved state-of-the-art (SOTA) accuracy of 93.9% and 92.5% on real and synthetic images in the SAT benchmark [7]. - It improved overall accuracy to 84.9% on the MindCube-Tiny benchmark, surpassing the baseline Qwen2.5-VL-7B by over 55% [7]. - In the VSI-Bench, it led in tasks such as object counting and path planning with an average accuracy of 62.2% [7].
大模型被确诊「视觉文盲」!多校联合提出MILO,为它植入空间想象力
量子位· 2025-12-04 09:55
Core Insights - The article discusses the limitations of multi-modal large language models (MLLMs) in spatial reasoning, highlighting their inability to effectively understand and visualize spatial concepts, leading to a phenomenon termed "visual illiteracy" [2][3]. Group 1: Challenges in Spatial Reasoning - Spatial reasoning is identified as a core cognitive ability for humans to understand three-dimensional structures, which poses a significant challenge for MLLMs in practical applications [2]. - Current methods primarily rely on "language description tuning," which fails to provide models with a true visual understanding of spatial concepts [2][3]. Group 2: Introduction of MILO - A research team has proposed MILO (Implicit Spatial World Modeling) to address the spatial reasoning challenges faced by MLLMs by integrating visual generative feedback with symbolic reasoning [4]. - MILO employs a two-phase training process: the first phase involves visual generative tuning where the model learns spatial transformations through visual outputs, and the second phase involves language tuning using spatial instruction data [5]. Group 3: Enhancements in Geometric Perception - To further enhance geometric perception, the team introduced RePE (Relative Positional Encoding), which captures relative transformations between adjacent frames instead of relying on a global coordinate system, improving generalization and adaptability across datasets [8][9]. Group 4: GeoGen Dataset - The research team constructed the GeoGen dataset, comprising approximately 2,241 videos and 267,000 "observation-action-result" triplets, aimed at enhancing geometric perception generation [10]. - The dataset includes diverse sources such as scanned 3D scenes and internet videos, ensuring a wide range of realistic scenarios [11]. Group 5: Validation of MILO - The effectiveness of MILO was validated across multiple baseline models and five categories of spatial understanding tasks, achieving optimal performance in 3D scene understanding tasks and spatial reasoning tasks [12][16]. - Notably, MILO improved accuracy by 3.2% in the ScanRefer task and achieved an average accuracy of 61.7% in the VSI-Bench spatial reasoning task, surpassing the baseline VG-LLM by 2.2% [16].
NeurIPS 2025 | SURDS 数据集与 GRPO 全面强化自驾空间推理
自动驾驶之心· 2025-09-27 23:33
Core Insights - The article discusses the challenges of achieving accurate spatial reasoning in autonomous driving scenarios using Vision Language Models (VLMs), highlighting the lack of large-scale benchmarks in this area [2][20]. - A new benchmark called SURDS has been introduced to systematically evaluate the spatial reasoning capabilities of VLMs, revealing significant shortcomings in current models [4][20]. Benchmark Overview - SURDS is a large-scale benchmark based on the nuScenes dataset, consisting of 41,080 visual-question training instances and 9,250 evaluation samples, covering six spatial categories: direction recognition, pixel-level localization, depth estimation, distance comparison, left-right ordering, and front-back relationships [4][20]. - The dataset includes diverse multimodal information collected from urban environments in Boston and Singapore, ensuring a realistic testing scenario [6][20]. Model Training and Evaluation - The research emphasizes the importance of data generation and introduces a novel automated process for generating high-quality reasoning chains, which enhances the model's spatial reasoning capabilities [8][10]. - A reinforcement learning framework combining spatial localization rewards and logical consistency objectives was designed, leading to significant performance improvements in various tasks [11][20]. Experimental Results - The evaluation results show that different models exhibit notable differences in spatial reasoning tasks, with the proposed model achieving a nearly 60% improvement in depth estimation accuracy compared to the second-best model [14][20]. - The study reveals that most existing models struggle with single-object tasks, often performing close to random levels, indicating a need for better learning of absolute pose and metric information [16][20]. Training Strategy Insights - Ablation studies indicate that combining localization and logical rewards significantly enhances model performance, underscoring the foundational role of localization ability in spatial reasoning [16][18]. - The research also highlights that the scale of model parameters does not directly correlate with spatial understanding capabilities, suggesting that simply increasing model size is insufficient [16][20].
AI Lab最新InternSpatia:VLM空间推理数据集,显著提升模型能力
具身智能之心· 2025-06-24 14:09
Core Insights - The article discusses the limitations of current Vision-Language Models (VLMs) in spatial reasoning tasks, highlighting the need for improved datasets and methodologies to enhance performance in various scenarios [3][12]. Dataset Limitations - The existing InternSpatial dataset has three main limitations: 1. Limited scene diversity, focusing primarily on indoor and outdoor environments, lacking diverse contexts like driving and embodied navigation [3]. 2. Restricted instruction formats, only supporting natural language or region masks, which do not encompass the variety of queries found in real-world applications [3]. 3. Lack of multi-view supervision, with over 90% of data focusing on single-image reasoning, failing to model spatiotemporal relationships across views [3]. Evaluation Benchmark - The InternSpatial-Bench evaluation benchmark includes 6,008 QA pairs across five tasks, assessing position comparison, size comparison, rotation estimation, object counting, and existence estimation [7]. - The benchmark also introduces 1,000 additional QA pairs for multi-view rotation angle prediction [7]. Data Engine Design - The data engine employs a three-stage automated pipeline: 1. Annotation generation using existing annotations or SAM2 for mask generation [9]. 2. View alignment to construct a standard 3D coordinate system [9]. 3. Template-based QA generation with predefined task templates [9]. Experimental Results - Spatial reasoning performance has improved, with InternVL-Spatial-8B showing a 1.8% increase in position comparison accuracy and a 17% increase in object counting accuracy compared to its predecessor [10]. - The model's performance across various tasks demonstrates significant enhancements, particularly in multi-view tasks [10]. Instruction Format Robustness - Current models exhibit a 23% accuracy drop when using the <box> format, while training with InternSpatial reduces the gap between different formats to within 5% [12]. - However, the automated QA generation struggles to replicate the complexity of natural language, indicating a need for further refinement [12].
多模态模型挑战北京杭州地铁图!o3成绩显著,但跟人类有差距
量子位· 2025-06-07 05:02
ReasonMap团队 投稿 量子位 | 公众号 QbitAI 近年来,大语言模型(LLMs)以及多模态大模型(MLLMs)在多种场景理解和复杂推理任务中取得突破性进展。 然而,一个关键问题仍然值得追问: 多模态大模型(MLLMs),真的能"看懂图"了吗? 特别是在面对结构复杂、细节密集的图像时,它们是否具备细粒度视觉理解与空间推理能力,比如挑战一下高清 地铁图 这种。 为此,来自西湖大学、新加坡国立大学、浙江大学、华中科技大学的团队提出了一个全新的评测基准 ReasonMap 。 看得出来北京、杭州的地铁图难倒了一大片模型。 这是首个聚焦于 高分辨率交通图(主要为地铁图)的多模态推理评测基准,专为评估大模型在理解图像中细粒度的结构化空间信息 方面的 能力而设计。 结果发现,当前主流开源的多模态模型在ReasonMap上面临明显性能瓶颈,尤其在 跨线路路径规划 上常出现视觉混淆或站点遗漏。 而经强化学习后训练的闭源推理模型(如 GPT-o3)在多个维度上 显著优于 现有开源模型,但与人类水平相比仍存在明显差距。 在面对不同国家地区的地铁图中,四个代表性 MLLM(Qwen2.5-VL-72B-I(蓝色)、 I ...