RAE+VAE? 预训练表征助力扩散模型Tokenizer，加速像素压缩到语义提取

Core Insights - The article discusses the introduction of RAE (Diffusion Transformers with Representation Autoencoders) and VFM-VAE by Xi'an Jiaotong University and Microsoft Research Asia, which utilize "frozen pre-trained visual representations" to enhance the performance of diffusion models in generating images [2][6][28]. Group 1: VFM-VAE Overview - VFM-VAE combines the probabilistic modeling mechanism of VAE with RAE, systematically studying the impact of compressed pre-trained visual representations on the structure and performance of LDM systems [2][6]. - The integration of frozen foundational visual models as Tokenizers in VFM-VAE significantly accelerates model convergence and improves generation quality, marking an evolution from pixel compression to semantic representation [2][6]. Group 2: Performance Analysis - Experimental results indicate that the distillation-based Tokenizers struggle with semantic alignment under perturbations, while maintaining high consistency between latent space and foundational visual model features is crucial for robustness and convergence efficiency [8][19]. - VFM-VAE demonstrates superior performance and training efficiency, achieving a gFID of 3.80 on ImageNet 256×256, outperforming the distillation route's 5.14, and reaching a gFID of 2.22 with explicit alignment in just 80 epochs, improving training efficiency by approximately 10 times [23][24]. Group 3: Semantic Representation and Alignment - The research team introduced the SE-CKNNA metric to quantify the consistency between latent space and foundational visual model features, which is essential for evaluating the impact on subsequent generation performance [7][19]. - VFM-VAE maintains a higher average and peak CKNNA score compared to distillation-based Tokenizers, indicating a more stable alignment of latent space with foundational visual model features [19][21]. Group 4: Future Directions - The article concludes with the potential for further exploration of latent space in multimodal generation and complex visual understanding, aiming to transition from pixel compression to semantic representation [29].