Visual Jigsaw
Search documents
AI玩拼图游戏暴涨视觉理解力,告别文本中心训练,无需标注的多模态大模型后训练范式
3 6 Ke· 2025-10-15 12:27
Core Insights - The article discusses the significance of a new post-training paradigm for multimodal large language models (MLLMs) that emphasizes visual self-supervised learning, particularly through a method called Visual Jigsaw [1][12]. Group 1: Visual Jigsaw Methodology - Visual Jigsaw is designed as a self-supervised task that focuses on reconstructing visual information by predicting the correct order of shuffled visual elements, applicable to images, videos, and 3D data [5][12]. - The training process utilizes a reinforcement learning algorithm called GRPO, incorporating a tiered reward mechanism based on the accuracy of the model's predictions [5][6]. Group 2: Experimental Results - Image Jigsaw training led to consistent improvements across three vision-centric benchmarks, enhancing fine-grained perception, spatial understanding from monocular images, and compositional visual reasoning [7][8]. - Video Jigsaw training demonstrated stable enhancements in video understanding benchmarks, particularly in tasks requiring temporal reasoning and understanding [9][10]. - 3D Jigsaw training resulted in significant improvements in various 3D benchmark tasks, especially in depth estimation, indicating enhanced overall spatial perception and reasoning capabilities [11][12]. Group 3: Implications and Future Directions - The introduction of Visual Jigsaw provides a lightweight, verifiable, and annotation-free self-supervised post-training paradigm, revitalizing visual perception in MLLMs [12]. - The research aims to inspire further development of self/weakly supervised tasks that focus on visual information, enabling better perception and understanding of various visual data [12].
AI玩拼图游戏暴涨视觉理解力,告别文本中心训练,无需标注的多模态大模型后训练范式
量子位· 2025-10-15 10:20
VisualJigsaw团队 投稿 量子位 | 公众号 QbitAI 在多模态大模型的后训练浪潮中,强化学习驱动的范式已成为提升模型推理与通用能力的关键方向。 然而,大多数现有方法仍 以文本为中心 ,视觉部分常被动地作为辅助信号输入。相比之下,我们认为在后训练阶段重新审视 视觉自监督学 习 的潜力,设计 以视觉为中心 的后训练对于增强多模态大模型对于视觉信息本身的细粒度深入理解也同样至关重要。 为此,来自MMLab@南洋理工大学的最新论文 《Visual Jigsaw Post-Training Improves MLLMs》 提出了一种全新的针对多模态大模 型后训练任务- Visual Jigsaw 。 它将经典的自监督拼图任务重新设计为多模态大模型后训练阶段的核心目标,让模型在不依赖额外标注、也无需视觉生成模块的情况下,显式 强化自身的视觉感知与理解能力。在图片,视频,和3D三种视觉模态下都验证了其有效性。 Visual Jigsaw 方法简介 对于不同视觉模态,具体的Visual Jigsaw任务设计如下 Image Jigsaw: 图片在2D空间上被划分为 个相同大小的子图,打乱后模型需恢复正确的空间 ...