Workflow
感知错误率降低30.5%:隐式感知损失让模型主动“睁大眼睛” | UIUC&阿里通义
量子位·2025-07-11 04:00

Core Viewpoint - The article discusses the introduction of a new reinforcement learning algorithm called PAPO (Perception-Aware Policy Optimization) developed by the University of Illinois Urbana-Champaign and Alibaba's Tongyi Laboratory, which focuses on enhancing multimodal reasoning by integrating perception into the learning process [1][3]. Group 1: Introduction of PAPO - PAPO aims to address the limitations of existing reinforcement learning algorithms like GRPO, which excel in text reasoning but struggle with multimodal scenarios due to inadequate visual information utilization [2][3]. - The algorithm introduces an innovative implicit perception loss design that relies on internal supervisory signals, allowing multimodal models to learn perception alongside reasoning [3][6]. Group 2: Error Analysis and Findings - A systematic error analysis revealed that the primary issue in multimodal reasoning is the accuracy of visual perception, rather than logical reasoning capabilities [6][7]. - The analysis of 200 error cases from the Qwen2.5-VL-3B model trained with GRPO showed that 67% of errors were due to perception inaccuracies, while only 18% were due to reasoning errors [9][14]. Group 3: Technical Innovations of PAPO - PAPO's core innovation includes the design of a perception information gain ratio and the maximization of KL divergence to encourage different output distributions for original and damaged images [19][20]. - The complete objective function for PAPO is presented as a simple extension of GRPO, integrating the KL divergence term [21]. Group 4: Experimental Validation - Comprehensive evaluations on eight multimodal reasoning benchmarks demonstrated that PAPO consistently outperformed GRPO, achieving an overall average improvement of 4.4% and a significant 30.5% reduction in perception errors [26][28]. - PAPO exhibited faster convergence and more stable training dynamics compared to GRPO, starting to show improvements as early as 25 training steps [29][30]. Group 5: Visual Dependency Analysis - The analysis of visual dependency in mainstream multimodal reasoning benchmarks indicated that many tasks contain substantial visual information, allowing models to answer correctly without visual input [50][51]. - PAPO showed the most significant improvements in high-visual-dependency tasks, with nearly an 8% enhancement, while maintaining consistent improvements across medium and low-dependency tasks [53][54]. Group 6: Practical Applications - Several practical application cases illustrate PAPO's effectiveness in complex geometric problems, such as accurately calculating relationships in right triangles and distinguishing between different objects [55][63][64].