无VAE扩散模型！清华&可灵团队「撞车」谢赛宁团队「RAE」

Core Insights - The article discusses the limitations of traditional Variational Autoencoder (VAE) in training diffusion models, highlighting issues such as low representation quality and efficiency [2][4][8] - A new framework called SVG (Self-supervised representation for Visual Generation) is proposed, which integrates pre-trained visual feature encoders to enhance representation quality and efficiency [3][12] Limitations of Traditional VAE - VAE's latent space suffers from semantic entanglement, leading to inefficiencies in training and inference [4][6] - The entangled features require more training steps for the diffusion model to learn data distribution, resulting in slower performance [6][8] SVG Framework - SVG combines a frozen DINOv3 encoder, a lightweight residual encoder, and a decoder to create a unified feature space with strong semantic structure and detail recovery [12][13] - The framework allows for high-dimensional training directly in the SVG feature space, which has shown to be stable and efficient [16][22] Performance Metrics - SVG-XL outperforms traditional models in generation quality and efficiency, achieving a gFID of 6.57 in just 80 epochs compared to SiT-XL's 1400 epochs [18][22] - The model demonstrates superior few-step inference performance, with a gFID of 12.26 at 5 sampling steps [22] Multi-task Generalization - The latent space of SVG inherits the beneficial properties of DINOv3, making it suitable for various tasks such as classification and segmentation without additional fine-tuning [23][24] - The unified feature space enhances adaptability across multiple visual tasks [24] Qualitative Analysis - SVG exhibits smooth interpolation and editability, outperforming traditional VAE in generating intermediate results during linear interpolation [26][30] Conclusion - The core value of SVG lies in its combination of self-supervised features and residual details, proving the feasibility of sharing a unified latent space for generation, understanding, and perception [28] - This approach addresses the efficiency and generalization issues of traditional LDMs and provides new insights for future visual model development [28]