视觉 - 语言模型(VLM)
Search documents
让VLM学会「心中有世界」:VAGEN用多轮RL把视觉智能变成「世界模型」推理机器
机器之心· 2025-10-25 03:20
Core Insights - The article discusses the limitations of Visual-Language Models (VLMs) in complex visual tasks, highlighting their tendency to act impulsively rather than thoughtfully due to their perception of the world being limited and noisy [2][6]. - The VAGEN framework aims to enhance VLMs by teaching them to construct an internal world model before taking actions, thereby promoting a more structured thinking process [3][12]. Group 1: VAGEN Framework - VAGEN enforces a structured "thinking template" for VLMs, which includes two core steps: State Estimation (observing the current state) and Transition Modeling (predicting future outcomes) [7][11]. - The framework utilizes reinforcement learning (RL) to reward this structured thinking process, demonstrating that the "World Modeling" strategy significantly outperforms both "No Think" and "Free Think" approaches [12][32]. Group 2: Internal Monologue and Reward Mechanism - The research explores the best format for the internal monologue of the agent, finding that the optimal representation depends on the nature of the task [13][14]. - VAGEN introduces two key components in its reward mechanism: World Modeling Reward, which provides immediate feedback after each thought process, and Bi-Level GAE for efficient reward distribution [18][20]. Group 3: Performance Results - The VAGEN-Full model, based on a 3B VLM, achieved an impressive overall score of 0.82 across five diverse tasks, outperforming various other models including GPT-5 [27][30]. - The results indicate that VAGEN-Full not only surpasses untrained models but also exceeds the performance of several proprietary models, showcasing its effectiveness in enhancing VLM capabilities [30][32].