Workflow
SpatialDreamer
icon
Search documents
复杂空间推理新SOTA,性能提升55%,中山大学新作SpatialDreamer
3 6 Ke· 2025-12-22 10:12
【导读】中山大学等机构推出SpatialDreamer,通过主动心理想象和空间推理,显著提升了复杂空间任务的性能。模拟人类主动探索、想象和推理的过 程,解决了现有模型在视角变换等任务中的局限,为人工智能的空间智能发展开辟了新路径。 论文链接: https://arxiv.org/pdf/2512.07733 尽管多模态大语言模型(MLLMs)在场景理解方面取得了显著进展,但在需要心理模拟的复杂空间推理任务上表现仍然有限。 现有方法多依赖于对空间数据的被动观察,缺乏人类在空间认知中特有的主动想象与动态更新内部表征的能力。 例如,在需要变换视角以判断遮挡物体位置的任务中,现有模型往往因视角单一而推理失败。 为此,来自MBZUAI与中山大学的研究团队提出了SpatialDreamer,一个基于强化学习的框架,旨在通过主动探索、视觉想象与证据融合的闭环过程,赋 予MLLMs类人的空间心理模拟能力。 SpatialDreamer模拟人类的空间认知过程,构建了一个包含以下三个步骤的闭环推理流程: 1) 探索:模型根据当前场景推理出最优的自我中心动作(如「前进0.75米」或「左转45度」); 2) 想象:调用世界模型(如S ...
复杂空间推理新SOTA,性能提升55%!中山大学新作SpatialDreamer
具身智能之心· 2025-12-22 01:22
Core Insights - The article discusses the introduction of SpatialDreamer, a framework developed by researchers from Sun Yat-sen University and MBZUAI, which enhances complex spatial task performance through active mental imagery and spatial reasoning [1][4]. Group 1: Limitations of Current Models - Despite significant advancements in multimodal large language models (MLLMs) for scene understanding, their performance remains limited in complex spatial reasoning tasks that require psychological simulation [2]. - Existing methods primarily rely on passive observation of spatial data, lacking the unique human ability for active imagination and dynamic internal representation updates [3]. Group 2: SpatialDreamer Framework - SpatialDreamer simulates human spatial cognition through a closed-loop reasoning process consisting of three steps: exploration, imagination, and reasoning [6]. - The exploration phase involves the model determining optimal self-centered actions based on the current scene, such as "move forward 0.75 meters" or "turn left 45 degrees" [6]. - The imagination phase generates new perspective images after executing actions using a world model [6]. - The reasoning phase integrates all accumulated visual evidence to produce a final answer [6]. Group 3: GeoPO Strategy Optimization - To address the issue of sparse rewards in long-sequence reasoning tasks, the research team introduced GeoPO, a strategy optimization method combining tree sampling structures and geometric consistency constraints [8]. - The tree sampling approach allows multiple action branches at each step, supporting backtracking and multi-path exploration [8]. - A multi-level reward design merges task-level and step-level rewards to provide fine-grained feedback [8]. - A geometric penalty mechanism imposes penalties on redundant or conflicting actions, encouraging efficient trajectory generation [8]. Group 4: Performance Validation - The effectiveness of SpatialDreamer was validated across multiple spatial reasoning benchmarks, achieving state-of-the-art (SOTA) results with an average accuracy of 93.9% and 92.5% on real and synthetic images, respectively, in the SAT benchmark [13]. - In the MindCube-Tiny benchmark, it achieved an overall accuracy of 84.9%, surpassing the baseline Qwen2.5-VL-7B by over 55% [13]. - In the VSI-Bench, it outperformed in tasks such as object counting, relative direction, and path planning, with an average accuracy of 62.2% [13]. Group 5: Significance of SpatialDreamer - The significance of SpatialDreamer lies not only in improving spatial reasoning accuracy but also in demonstrating that MLLMs can enhance reasoning capabilities through "imagination," marking a significant step towards human-like spatial intelligence [14].