Workflow
VisPlay框架
icon
Search documents
VLM也能「自我进化」!RL自我进化框架VisPlay突破视觉推理难题
具身智能之心· 2025-12-02 09:30
Core Insights - The article discusses the introduction of VisPlay, a self-evolving reinforcement learning framework for Vision-Language Models (VLM), which allows for self-improvement using vast amounts of unlabeled image data [2][3][18] Group 1: Challenges in VLM - VLMs have made significant progress in perception tasks but struggle with complex visual reasoning due to reliance on high-quality labeled data [5] - Traditional methods like supervised fine-tuning and reinforcement learning face bottlenecks as the cost and speed of manual labeling cannot keep up with the evolving model demands [5][4] Group 2: VisPlay Framework - VisPlay is designed to address the challenges of VLMs by implementing a self-evolution mechanism that allows models to learn autonomously from unlabeled images [7][8] - The framework divides the VLM into two roles: the "Questioner," which generates challenging visual questions, and the "Reasoner," which provides answers based on the images and questions [10][12] Group 3: Reward Mechanism - VisPlay employs a sophisticated reward mechanism that includes Difficulty Reward and Diversity Reward to enhance the quality of generated questions and answers [10][11] - This approach effectively mitigates common issues in self-evolving models, such as low answer quality and high question redundancy, leading to significant improvements in capability [11] Group 4: Experimental Results - VisPlay has been tested on mainstream VLM models like Qwen2.5-VL and MiMo-VL across eight benchmark datasets, showing consistent and significant accuracy gains [15][17] - The framework demonstrates strong generalization capabilities, particularly in unseen complex reasoning combinations, and effectively reduces the occurrence of "hallucinations" in VLMs [17][18]