Workflow
视觉空间推理
icon
Search documents
腾讯&上海交大等高校联合发布视觉空间推理综述.
具身智能之心· 2025-10-15 11:03
Core Insights - The article discusses the current state of visual spatial reasoning tasks and the importance of Vision Language Models (VLMs) in applications like autonomous driving and embodied intelligence [2][3]. - It highlights the need for a comprehensive evaluation of VLMs' spatial reasoning capabilities through improved methodologies and task settings [3]. Group 1: Current State of Visual Spatial Reasoning - The spatial reasoning capabilities of VLMs have gained significant attention, with research focusing on model structure improvements, training process optimization, and reasoning strategies [2]. - Existing benchmarks often fail to provide a comprehensive assessment of spatial reasoning tasks, necessitating a systematic review of methods and task settings [3]. Group 2: Contributions of the Article - The article categorizes existing improvements in visual spatial reasoning into four areas: input modalities, model structure, training strategies, and reasoning methods [6]. - It introduces a new benchmarking tool, SIBench, which consolidates 18 open-source benchmarks and covers three levels of tasks and various input forms [22][23]. Group 3: Task Classification - Tasks are classified into three levels: Basic Perception, Spatial Understanding, and Planning, each with specific characteristics and requirements [12][15]. - Basic Perception involves attributes of single targets, while Spatial Understanding deals with relationships between multiple targets and their environments [18][20]. - Planning requires understanding spatial constraints and task demands to provide satisfactory solutions [21]. Group 4: Findings from SIBench - The evaluation of mainstream VLMs using SIBench revealed significant deficiencies in four areas, particularly in basic perception capabilities, which are crucial for subsequent reasoning [27]. - Quantitative reasoning abilities were found to be lacking compared to qualitative tasks, indicating a need for improvement in tasks like counting and distance estimation [27]. - The models showed weak performance in processing dynamic information, especially with multi-view or video inputs [27].