Workflow
Visual Self-Supervised Learning
icon
Search documents
AI玩拼图游戏暴涨视觉理解力,告别文本中心训练,无需标注的多模态大模型后训练范式
3 6 Ke· 2025-10-15 12:27
Core Insights - The article discusses the significance of a new post-training paradigm for multimodal large language models (MLLMs) that emphasizes visual self-supervised learning, particularly through a method called Visual Jigsaw [1][12]. Group 1: Visual Jigsaw Methodology - Visual Jigsaw is designed as a self-supervised task that focuses on reconstructing visual information by predicting the correct order of shuffled visual elements, applicable to images, videos, and 3D data [5][12]. - The training process utilizes a reinforcement learning algorithm called GRPO, incorporating a tiered reward mechanism based on the accuracy of the model's predictions [5][6]. Group 2: Experimental Results - Image Jigsaw training led to consistent improvements across three vision-centric benchmarks, enhancing fine-grained perception, spatial understanding from monocular images, and compositional visual reasoning [7][8]. - Video Jigsaw training demonstrated stable enhancements in video understanding benchmarks, particularly in tasks requiring temporal reasoning and understanding [9][10]. - 3D Jigsaw training resulted in significant improvements in various 3D benchmark tasks, especially in depth estimation, indicating enhanced overall spatial perception and reasoning capabilities [11][12]. Group 3: Implications and Future Directions - The introduction of Visual Jigsaw provides a lightweight, verifiable, and annotation-free self-supervised post-training paradigm, revitalizing visual perception in MLLMs [12]. - The research aims to inspire further development of self/weakly supervised tasks that focus on visual information, enabling better perception and understanding of various visual data [12].