AI玩拼图游戏暴涨视觉理解力，告别文本中心训练，无需标注的多模态大模型后训练范式

Core Insights - The article emphasizes the importance of a vision-centric approach in post-training for multimodal large models, highlighting the potential of visual self-supervised learning to enhance the understanding of visual information [1] - A novel post-training task called Visual Jigsaw is introduced, which focuses on reconstructing visual data without relying on additional annotations or visual generation modules [1] Visual Jigsaw Method Overview - Visual Jigsaw is a general task for reconstructing visual information by dividing data (images, videos, 3D) into segments and shuffling them, with the model's goal being to predict the correct order [5] - The training process utilizes a reinforcement learning algorithm called GRPO to optimize the model's performance [5] Reward Mechanism - A tiered reward system is designed for validating the model's predictions, where a correct prediction receives a reward of 1, partial correctness is rewarded proportionally with a discount factor, and invalid outputs receive no reward [6] Task Design for Different Visual Modalities - Image Jigsaw: Images are divided into equal-sized sub-images in 2D space, and the model must restore the correct spatial order [7] - Video Jigsaw: Videos are segmented into equal-length clips, and the model needs to reconstruct the original temporal order [8] - 3D Jigsaw: RGB-D images are sampled for depth points, requiring the model to restore the order from near to far based on marked positions and shuffled indices [9] Experimental Results - The effectiveness of Visual Jigsaw was validated across various image, video, and 3D modalities, showing significant improvements in fine-grained perception and understanding, spatial understanding from monocular images, and compositional visual reasoning [10][11] - For Image Jigsaw, models showed stable improvements across multiple vision-centric benchmarks, enhancing fine-grained perception and understanding [10][11] - For Video Jigsaw, the method significantly improved the model's ability to understand temporal relationships and overall video comprehension [14] - For 3D Jigsaw, notable enhancements were observed in depth estimation tasks and overall 3D spatial reasoning capabilities [15] Conclusion - Visual Jigsaw presents a lightweight, verifiable, and annotation-free self-supervised post-training paradigm that revitalizes visual perception in multimodal large models, encouraging further exploration of vision-focused self/weak supervision tasks [16]