大模型掌握人类空间思考能力！三阶段训练框架学会“边画边想”，5个基准平均提升18.4%

Core Insights - The article discusses the development of the ViLaSR-7B model, which enhances spatial reasoning capabilities in large vision-language models (LVLMs) through a novel "Drawing to Reason in Space" paradigm, achieving significant improvements in various spatial reasoning tasks [1][17][33]. Group 1: Model Performance - ViLaSR-7B achieved an average improvement of 18.4% across five major spatial reasoning benchmarks, including maze navigation and video spatial reasoning [3][25]. - The model reached a 45.4% accuracy on the VSI-Bench, outperforming the Qwen2.5-VL-7B by 12.7% [26]. Group 2: Training Framework - The model employs a three-stage training framework: 1. Cold-start training establishes basic visual operation capabilities [22]. 2. Reflective rejection sampling enhances self-correction and reflection abilities [23]. 3. Reinforcement learning optimizes overall reasoning capabilities and drawing operation efficiency [24]. Group 3: Reasoning Paradigms - The article highlights a shift from the traditional "visual-to-text" reasoning paradigm to the "Thinking with Images" paradigm, which allows models to actively manipulate images during reasoning [10][15]. - This new paradigm addresses limitations in the traditional approach, such as loss of critical details and temporal information during the visual encoding process [11][16]. Group 4: Human-like Reasoning Strategies - ViLaSR-7B demonstrates human-like spatial reasoning strategies, such as reference-based measurement reasoning and systematic cross-frame object tracking [30][32]. - The model's ability to identify and utilize reference objects for accurate measurements reflects a mature reasoning process similar to human problem-solving [31].