Workflow
无监督视觉思维链推理
icon
Search documents
ICCV 2025|UV-CoT:无监督视觉推理新突破,偏好优化重塑图像级思维链
机器之心· 2025-07-28 04:24
Core Viewpoint - The article discusses the introduction of a novel unsupervised visual reasoning framework called UV-CoT, which enhances model reasoning capabilities and interpretability in visual understanding tasks by leveraging a chain-of-thought (CoT) approach [2][3][25]. Group 1: Background and Challenges - Existing models rely on supervised fine-tuning (SFT) strategies that require extensive labeled data, leading to high annotation costs and limited scalability [6][7]. - SFT methods face challenges such as high labor costs for annotating key image regions and reasoning paths, and limited generalization capabilities due to reliance on a single type of training signal [7]. Group 2: UV-CoT Framework - UV-CoT is designed to mimic human visual understanding by focusing on "key regions → reasoning process," employing an unsupervised data generation and preference optimization mechanism [4][3]. - The framework utilizes an automated preference data generation and evaluation process, guided by an improved preference optimization algorithm called Score-DPO (sDPO), to achieve unsupervised image-level chain-of-thought learning [8][11]. Group 3: Methodology - UV-CoT generates diverse intermediate reasoning responses for image-question pairs using a target model and an evaluation model, which assesses the selected regions' scores and their impact on subsequent answers [13]. - The preference data set is constructed by randomly selecting preference pairs from the generated responses, retaining the highest-scoring response chains for further reasoning [14]. Group 4: Performance and Results - UV-CoT demonstrates significant performance improvements over existing supervised chain-of-thought models, outperforming models like Visual-CoT-7B and LLaVA-1.5-7B across six benchmarks [20][22]. - The self-evaluation capability of UV-CoT leads to high-quality bounding box generation, surpassing LLaVA-1.5-7B by 4.8% and closely approaching the performance of the 12B model OmniLMM-12B [23]. Group 5: Conclusion - UV-CoT presents an innovative approach to unsupervised visual reasoning, eliminating the dependency on manual annotations and enabling automatic identification and reasoning optimization of key image regions, thus laying a solid foundation for future research in unsupervised visual understanding [25].