OpenVision 2：大道至简的生成式预训练视觉编码器

Core Insights - The article discusses the development of OpenVision 2, a generative visual pre-training model that simplifies the training process while maintaining optimal performance and significantly improving training efficiency [2][21]. Group 1: OpenVision 2 Overview - OpenVision 2 is a new direction in generative visual pre-training, proposed by researchers from UCSC, Apple, and UCB, which enhances training efficiency while achieving a parameter scale of 1 billion [2][21]. - The model eliminates the complexity of the training pipeline found in its predecessor, OpenVision, by removing the text encoder and contrastive learning, focusing solely on the "image → description" generation target [9][21]. Group 2: Performance and Efficiency - Experimental results show that OpenVision 2 performs comparably to or better than OpenAI's CLIP and Google's SigLIP on various multimodal benchmark tasks, particularly excelling in OCR and text-related tasks [14][21]. - The training time for OpenVision 2 is reduced by 1.5 to 2 times, with memory usage cut by nearly half, allowing for larger batch sizes and more efficient training [14][16]. Group 3: Key Innovations - OpenVision 2 introduces a technique of randomly dropping about 2/3 of the visual tokens during pre-training, which reduces the computational burden on the text decoder and enhances training efficiency [10][22]. - The model relies on high-quality synthetic descriptions as the sole supervision signal, which aligns closely with downstream tasks, reducing the "goal misalignment" between pre-training and application [22][21]. Group 4: Community Impact - The research challenges the long-standing dominance of contrastive learning, demonstrating that powerful visual encoders can be trained through a generative framework, paving the way for future developments in multimodal foundational models [21][22]. - Over 25 different models of varying scales and configurations have been open-sourced, providing a reproducible and scalable resource base for both academia and industry [21].