Workflow
Visual Tokenizer
icon
Search documents
压缩之外,Visual Tokenizer 也要理解世界?
机器之心· 2025-12-28 01:30
Core Insights - The article discusses the evolution of Visual Tokenizer and its significance in understanding the world, suggesting that the next step in its development is to enhance its ability to comprehend high-level semantics rather than just focusing on pixel-level reconstruction [5][6][9]. Group 1: Visual Tokenizer Research - MiniMax and researchers from Huazhong University of Science and Technology have released a new study on Visual Tokenizer Pre-training (VTP), which has sparked significant interest in the industry [6]. - Traditional visual generation models typically involve a two-step process: compressing images using a tokenizer (like VAE) and then training a generative model in latent space [6]. - The study indicates that improving the performance of generative models can be achieved not only by scaling the main model but also by enhancing the tokenizer [6][8]. - The research reveals that focusing solely on pixel-level reconstruction can lead to a decline in downstream generative quality, as traditional tokenizers tend to favor low-level pixel information over high-level semantic representation [7][8]. - VTP proposes that introducing semantic understanding in tokenizer pre-training can make latent representations more sensitive to high-level semantics without overly memorizing pixel details [8][9]. Group 2: VTP Framework and Findings - The VTP framework integrates image-text contrastive learning (like CLIP), self-supervised learning (like DINOv2), and traditional reconstruction loss to optimize the latent space of visual tokenizers [9][10]. - The framework retains lightweight reconstruction loss for visual fidelity while introducing two semantic-oriented tasks: self-supervised loss based on DINOv2 and contrastive loss based on CLIP [9][10]. - Experimental results show a strong positive correlation between the semantic quality of the latent space (measured by zero-shot classification accuracy) and generative performance (measured by FID) [11]. - The largest VTP model (approximately 700 million parameters) achieved a zero-shot classification accuracy of 78.2% on ImageNet, with a reconstruction fidelity (rFID) of 0.36, comparable to specialized representation learning models [11][12]. - Replacing the tokenizer in a standard diffusion model training with VTP led to a 65.8% reduction in FID relative to the baseline and a fourfold increase in convergence speed [12][13]. - This indicates that investing more computational resources in tokenizer pre-training can significantly enhance downstream generative quality without increasing the complexity of the generative model [13].
NeurIPS 2025|VFMTok: Visual Foundation Models驱动的Tokenizer时代来临
机器之心· 2025-10-28 09:37
Core Insights - The article discusses the potential of using frozen Visual Foundation Models (VFMs) as effective visual tokenizers for autoregressive image generation, highlighting their ability to enhance image reconstruction and generation tasks [3][11][31]. Group 1: Traditional Visual Tokenizers - Traditional visual tokenizers like VQGAN require training from scratch, leading to a potential space that lacks high-level semantic information and has high redundancy [4][7]. - The organization of the latent space in traditional models is chaotic, resulting in longer training times and the need for additional techniques like Classifier-Free Guidance (CFG) for high-fidelity image generation [7][12]. Group 2: Visual Foundation Models (VFMs) - Pre-trained VFMs such as CLIP, DINOv2, and SigLIP2 excel in extracting rich semantic and generalizable visual features, primarily used for image content understanding tasks [4][11]. - The hypothesis proposed by the research team is that the latent features from these VFMs can also be utilized for image reconstruction and generation tasks [4][10]. Group 3: VFMTok Architecture - VFMTok utilizes frozen VFMs to construct high-quality visual tokenizers, employing multi-level feature extraction to capture both low-level details and high-level semantics [14][17]. - The architecture includes a region-adaptive quantization mechanism that improves token efficiency by focusing on consistent patterns within the image [18][19]. Group 4: Experimental Findings - VFMTok demonstrates superior performance in image reconstruction and autoregressive generation compared to traditional tokenizers, achieving better reconstruction quality with fewer tokens (256) [23][28]. - The convergence speed of autoregressive models during training is significantly improved with VFMTok, outperforming classic models like VQGAN [24][26]. Group 5: CFG-Free Performance - VFMTok shows consistent performance with or without CFG, indicating strong semantic consistency in its latent space, which allows for high-fidelity class-to-image generation without additional guidance [33]. - The reduction in token count leads to approximately four times faster inference speed during the generation process [33]. Group 6: Future Outlook - The findings suggest that leveraging the prior knowledge from VFMs is crucial for constructing high-quality latent spaces and developing the next generation of tokenizers [32]. - The potential for a unified tokenizer that is semantically rich and efficient across various generative models is highlighted as a future research direction [32].