Vision-Zero：零数据VLM自我进化！陈怡然团队提出零监督训练新范式

Core Insights - The article discusses the development of Vision-Zero, a self-play framework designed for Vision-Language Models (VLM), which aims to overcome the limitations of traditional training methods that rely heavily on human-annotated data and reinforcement learning rewards [6][7][26]. Background - VLMs have shown impressive performance in multimodal tasks, but they face challenges such as data scarcity due to high annotation costs and a knowledge ceiling that limits model capabilities [6]. - The Vision-Zero framework introduces a self-play strategy that allows VLMs to generate complex reasoning data autonomously, eliminating the need for manual annotation [6]. Framework Characteristics - Vision-Zero employs a self-play framework based on social reasoning games, enabling agents to generate high-complexity reasoning data during self-play [6]. - It allows any form of image as input, enhancing the model's ability to generalize across various domains [6]. - The framework incorporates an iterative self-play policy optimization algorithm that addresses performance bottlenecks common in traditional self-play methods [7]. Game Design - Inspired by social reasoning games, Vision-Zero includes a set of rules where agents must deduce hidden roles based on subtle differences in images, fostering complex reasoning chains [12][15]. - The game requires only two images with slight differences, making data construction simple and cost-effective [17]. Training Methodology - The framework utilizes a dual-phase alternating training approach to avoid local equilibrium and knowledge saturation, enhancing the model's ability to explore new reasoning paths [20]. - This method has shown to significantly outperform single-phase training in various tasks [20]. Experimental Results - Vision-Zero demonstrates strong task generalization capabilities, outperforming state-of-the-art methods that require annotated data across multiple benchmark datasets [22]. - The models trained under Vision-Zero effectively mitigate negative transfer issues commonly seen in VLMs, maintaining performance across different tasks [24]. Implications - Vision-Zero illustrates the feasibility and potential of self-play in transitioning from single-task to general-task applications, breaking free from the constraints of manual annotation and knowledge limitations [26].