Core Viewpoint - The article discusses a new paradigm called Visual Planning based on Reinforcement Learning (VPRL), which allows models to perform reasoning purely based on images without relying on language intermediaries, achieving superior performance compared to text-based reasoning methods [1][4][34]. Group 1: VPRL Framework - VPRL utilizes Group Relative Policy Optimization (GRPO) for post-training large visual models, significantly outperforming text-based reasoning in various visual navigation tasks [3][4]. - The accuracy of VPRL reaches 80%, exceeding text-based reasoning by at least 40%, validating that visual planning is significantly better than text planning [4][34]. - The framework consists of two phases: policy initialization through random trajectories and reinforcement learning optimization to guide effective planning [10][11][13]. Group 2: Experimental Setup - The research team selected three representative tasks that can be expressed and executed entirely visually: FrozenLake, Maze, and MiniBehavior [19][20][21]. - The models used were specifically trained on visual data, ensuring no exposure to text data during pre-training [23]. - Evaluation metrics included Exact Match Rate (EM) and Progress Rate (PR), measuring the success of generating optimal planning trajectories [25]. Group 3: Experimental Results - Results indicate that visual planning (VPFT and VPRL) outperformed text planning across all tasks, with VPRL achieving an average EM of 80.6%, far surpassing the text baseline (Gemini 2.5 Pro with an average EM of 43.7%) [27]. - VPRL demonstrated over a 20% improvement compared to the supervised baseline VPFT, particularly excelling in the complex MiniBehavior task with an EM of 75.8% [28]. - VPRL showed greater robustness, with performance decreasing gradually as grid size increased, while text-based models experienced a sharp decline [31]. Group 4: Conclusion - The findings validate the feasibility of pure visual reasoning and highlight the potential of the VPRL framework to surpass text models in visual navigation tasks, promoting the development of multimodal reasoning towards a more intuitive visual direction [34].
纯靠“脑补”图像,大模型推理准确率狂飙80%丨剑桥谷歌新研究
量子位·2025-05-21 04:01