无需标注图像，VLM也能「自我进化」！RL自我进化框架VisPlay突破视觉推理难题

Core Insights - The article discusses the challenges in enhancing the reasoning capabilities of Vision-Language Models (VLMs), which typically rely on expensive labeled data or heuristic rewards, making scalability difficult [2][7]. - A new framework called VisPlay is introduced, which allows VLMs to evolve and improve their capabilities using vast amounts of unlabeled image data through a self-evolving reinforcement learning approach [3][9]. Summary by Sections Vision-Language Model Challenges - VLMs have made significant progress in perception tasks but struggle with complex visual reasoning due to their dependence on high-quality labeled data [7]. - Traditional methods like supervised fine-tuning and reinforcement learning face bottlenecks as the cost and speed of manual labeling cannot keep up with the evolving model demands [7]. VisPlay Framework - VisPlay is a self-evolving framework that decomposes a base VLM into two interacting roles: the Questioner and the Reasoner, facilitating self-improvement through iterative evolution [3][10]. - The Questioner generates challenging yet answerable visual questions, guided by a reward mechanism that balances question complexity and answer quality [11][12]. - The Reasoner produces "Silver Responses" based on the images and questions, using answer accuracy as a training signal [13]. Experimental Results - VisPlay has been applied to mainstream VLM models like Qwen2.5-VL and MiMo-VL, demonstrating consistent performance improvements across various benchmarks, including general visual understanding and cross-modal reasoning [5][16]. - The results show significant accuracy gains, with VisPlay achieving higher scores in multiple categories compared to base models, indicating its effectiveness and generalizability [17]. - VisPlay enhances the model's robustness in unseen complex reasoning combinations and effectively reduces the occurrence of "hallucinations," a common issue in VLMs [18]. Conclusion - The success of VisPlay illustrates the feasibility of improving VLM reasoning capabilities solely through vast amounts of unstructured images, paving the way for the development of more intelligent and autonomous multimodal systems [19].