Workflow
Visual Foundation Models
icon
Search documents
NeurIPS 2025|VFMTok: Visual Foundation Models驱动的Tokenizer时代来临
机器之心· 2025-10-28 09:37
Core Insights - The article discusses the potential of using frozen Visual Foundation Models (VFMs) as effective visual tokenizers for autoregressive image generation, highlighting their ability to enhance image reconstruction and generation tasks [3][11][31]. Group 1: Traditional Visual Tokenizers - Traditional visual tokenizers like VQGAN require training from scratch, leading to a potential space that lacks high-level semantic information and has high redundancy [4][7]. - The organization of the latent space in traditional models is chaotic, resulting in longer training times and the need for additional techniques like Classifier-Free Guidance (CFG) for high-fidelity image generation [7][12]. Group 2: Visual Foundation Models (VFMs) - Pre-trained VFMs such as CLIP, DINOv2, and SigLIP2 excel in extracting rich semantic and generalizable visual features, primarily used for image content understanding tasks [4][11]. - The hypothesis proposed by the research team is that the latent features from these VFMs can also be utilized for image reconstruction and generation tasks [4][10]. Group 3: VFMTok Architecture - VFMTok utilizes frozen VFMs to construct high-quality visual tokenizers, employing multi-level feature extraction to capture both low-level details and high-level semantics [14][17]. - The architecture includes a region-adaptive quantization mechanism that improves token efficiency by focusing on consistent patterns within the image [18][19]. Group 4: Experimental Findings - VFMTok demonstrates superior performance in image reconstruction and autoregressive generation compared to traditional tokenizers, achieving better reconstruction quality with fewer tokens (256) [23][28]. - The convergence speed of autoregressive models during training is significantly improved with VFMTok, outperforming classic models like VQGAN [24][26]. Group 5: CFG-Free Performance - VFMTok shows consistent performance with or without CFG, indicating strong semantic consistency in its latent space, which allows for high-fidelity class-to-image generation without additional guidance [33]. - The reduction in token count leads to approximately four times faster inference speed during the generation process [33]. Group 6: Future Outlook - The findings suggest that leveraging the prior knowledge from VFMs is crucial for constructing high-quality latent spaces and developing the next generation of tokenizers [32]. - The potential for a unified tokenizer that is semantically rich and efficient across various generative models is highlighted as a future research direction [32].