压缩之外，Visual Tokenizer 也要理解世界？

Core Insights - The article discusses the evolution of Visual Tokenizer and its significance in understanding the world, suggesting that the next step in its development is to enhance its ability to comprehend high-level semantics rather than just focusing on pixel-level reconstruction [5][6][9]. Group 1: Visual Tokenizer Research - MiniMax and researchers from Huazhong University of Science and Technology have released a new study on Visual Tokenizer Pre-training (VTP), which has sparked significant interest in the industry [6]. - Traditional visual generation models typically involve a two-step process: compressing images using a tokenizer (like VAE) and then training a generative model in latent space [6]. - The study indicates that improving the performance of generative models can be achieved not only by scaling the main model but also by enhancing the tokenizer [6][8]. - The research reveals that focusing solely on pixel-level reconstruction can lead to a decline in downstream generative quality, as traditional tokenizers tend to favor low-level pixel information over high-level semantic representation [7][8]. - VTP proposes that introducing semantic understanding in tokenizer pre-training can make latent representations more sensitive to high-level semantics without overly memorizing pixel details [8][9]. Group 2: VTP Framework and Findings - The VTP framework integrates image-text contrastive learning (like CLIP), self-supervised learning (like DINOv2), and traditional reconstruction loss to optimize the latent space of visual tokenizers [9][10]. - The framework retains lightweight reconstruction loss for visual fidelity while introducing two semantic-oriented tasks: self-supervised loss based on DINOv2 and contrastive loss based on CLIP [9][10]. - Experimental results show a strong positive correlation between the semantic quality of the latent space (measured by zero-shot classification accuracy) and generative performance (measured by FID) [11]. - The largest VTP model (approximately 700 million parameters) achieved a zero-shot classification accuracy of 78.2% on ImageNet, with a reconstruction fidelity (rFID) of 0.36, comparable to specialized representation learning models [11][12]. - Replacing the tokenizer in a standard diffusion model training with VTP led to a 65.8% reduction in FID relative to the baseline and a fourfold increase in convergence speed [12][13]. - This indicates that investing more computational resources in tokenizer pre-training can significantly enhance downstream generative quality without increasing the complexity of the generative model [13].