AI Lab最新InternSpatia：VLM空间推理数据集，显著提升模型能力

Core Insights - The article discusses the limitations of current Vision-Language Models (VLMs) in spatial reasoning tasks, highlighting the need for improved datasets and methodologies to enhance performance in various scenarios [3][12]. Dataset Limitations - The existing InternSpatial dataset has three main limitations: 1. Limited scene diversity, focusing primarily on indoor and outdoor environments, lacking diverse contexts like driving and embodied navigation [3]. 2. Restricted instruction formats, only supporting natural language or region masks, which do not encompass the variety of queries found in real-world applications [3]. 3. Lack of multi-view supervision, with over 90% of data focusing on single-image reasoning, failing to model spatiotemporal relationships across views [3]. Evaluation Benchmark - The InternSpatial-Bench evaluation benchmark includes 6,008 QA pairs across five tasks, assessing position comparison, size comparison, rotation estimation, object counting, and existence estimation [7]. - The benchmark also introduces 1,000 additional QA pairs for multi-view rotation angle prediction [7]. Data Engine Design - The data engine employs a three-stage automated pipeline: 1. Annotation generation using existing annotations or SAM2 for mask generation [9]. 2. View alignment to construct a standard 3D coordinate system [9]. 3. Template-based QA generation with predefined task templates [9]. Experimental Results - Spatial reasoning performance has improved, with InternVL-Spatial-8B showing a 1.8% increase in position comparison accuracy and a 17% increase in object counting accuracy compared to its predecessor [10]. - The model's performance across various tasks demonstrates significant enhancements, particularly in multi-view tasks [10]. Instruction Format Robustness - Current models exhibit a 23% accuracy drop when using the format, while training with InternSpatial reduces the gap between different formats to within 5% [12]. - However, the automated QA generation struggles to replicate the complexity of natural language, indicating a need for further refinement [12].