VLM自我迭代
Search documents
Adobe 新研究:不用再「喂」训练数据,VLM 靠和自己玩游戏变聪明
Founder Park· 2025-10-13 10:57
Core Insights - The article discusses the limitations of Vision Language Models (VLM) due to their reliance on human-annotated data and the introduction of a new framework called Vision-Zero, which allows VLMs to self-train without human supervision, similar to AlphaGo's self-play method [3][9][24] Group 1: Vision-Zero Framework - Vision-Zero provides a general framework for zero-supervised training of VLMs, enabling them to learn through self-play in a game-like environment [3][9] - The framework allows for any form of image input, enhancing the model's ability to generalize across various domains [9][17] - The iterative self-play optimization algorithm (Iterative-SPO) proposed in Vision-Zero addresses performance bottlenecks common in traditional self-play methods [9][18] Group 2: Experimental Results - Vision-Zero outperformed other state-of-the-art (SOTA) methods that rely on labeled data in reasoning, chart question answering, and vision-centric understanding tasks [3][19] - The VisionZero-Qwen-7B model showed improvements of approximately 3% on CLEVR and Real-World tasks and 2.8% on Chart tasks compared to baseline methods [19] - The framework demonstrated strong task generalization capabilities, effectively transferring learned skills to broader reasoning and mathematical tasks without explicit training on those tasks [19][24] Group 3: Addressing Challenges - Vision-Zero tackles the issue of negative transfer, where models trained on specific tasks perform worse on others, by employing a multi-capability training strategy [22][24] - The framework's design allows for continuous performance improvement by alternating between different training phases, thus avoiding local equilibrium issues common in pure self-play training [18][24]