V-Thinker: 让模型像人一样「边画边想」

Core Insights - The article introduces V-Thinker, a multi-modal reasoning framework aimed at enhancing visual interactive reasoning by enabling models to generate code and interact with images during the reasoning process [3][19][40]. Group 1: Framework and Methodology - V-Thinker combines cold-start supervised fine-tuning with reinforcement learning to allow models to autonomously generate code and interact with images, achieving a "think while drawing" visual reasoning paradigm [3][21]. - The framework includes a data evolution mechanism called Data Evolution Flywheel, which synthesizes and validates visual interactive reasoning data across diversity, quality, and difficulty dimensions [3][12]. - A progressive training paradigm is designed, starting with enhancing visual perception capabilities through a dataset called V-Perception-40K, followed by a two-stage training approach that integrates supervised fine-tuning and reinforcement learning [15][18]. Group 2: Data and Evaluation - The V-Interaction-400K dataset is constructed to support visual interactive reasoning and image-to-code conversion tasks, providing a foundational resource for the framework [3][13]. - VTBench is developed as an evaluation benchmark specifically for visual interactive reasoning, focusing on tasks that require interaction with images, such as adding auxiliary lines or marking key areas [19][20]. - The evaluation design includes three types of tasks that cover the complete process from basic perception to interactive reasoning, ensuring that the assessment reflects the model's visual interactive reasoning capabilities [23]. Group 3: Experimental Results - V-Thinker shows significant improvements in interactive reasoning tasks, outperforming baseline models with an average accuracy increase of over 12%, particularly excelling in instruction-guided interaction scenarios with a performance boost exceeding 22% [24]. - The model demonstrates enhanced visual interaction capabilities and generalization in common reasoning scenarios, achieving a 6% performance increase in complex multi-step reasoning tasks [25][26]. - The model's ability to generate diverse interactive paths during the reinforcement learning phase indicates a stronger strategy diversity and improved interpretability in the interactive reasoning process [29][31]. Group 4: Future Directions - The article emphasizes the potential for V-Thinker to advance the "Thinking with Images" direction, showcasing the model's ability to autonomously generate and execute code while interacting with images [40]. - It suggests that as model capabilities continue to improve, new possibilities for reasoning paradigms and application scenarios may emerge, including the potential for models to create knowledge [40]. - The authors acknowledge that there is still room for improvement in perception and interaction capabilities, indicating future work may involve incorporating different resolution perturbations [40].