天下苦VAE久矣：阿里高德提出像素空间生成模型训练范式, 彻底告别VAE依赖

Core Insights - The article discusses the rapid development of image generation technology based on diffusion models, highlighting the limitations of the Variational Autoencoder (VAE) and introducing the EPG framework as a solution [1][19]. Training Efficiency and Generation Quality - EPG demonstrates significant improvements in training efficiency and generation quality, achieving a FID of 2.04 and 2.35 on ImageNet-256 and ImageNet-512 datasets, respectively, with only 75 model forward computations [3][19]. - Compared to the mainstream VAE-based models like DiT and SiT, EPG requires significantly less pre-training and fine-tuning time, with 57 hours for pre-training and 139 hours for fine-tuning, versus 160 hours and 506 hours for DiT [7]. Consistency Model Training - EPG successfully trains a consistency model in pixel space without relying on VAE or pre-trained diffusion model weights, achieving a FID of 8.82 on ImageNet-256 [5][19]. Training Complexity and Costs - The VAE's training complexity arises from the need to balance compression rate and reconstruction quality, making it challenging [6]. - Fine-tuning costs are high when adapting to new domains, as poor performance of the pre-trained VAE necessitates retraining the entire model, increasing development time and costs [6]. Two-Stage Training Method - EPG employs a two-stage training method: self-supervised pre-training (SSL Pre-training) and end-to-end fine-tuning, decoupling representation learning from pixel reconstruction [8][19]. - The first stage focuses on extracting high-quality visual features from noisy images using a contrastive loss and representation consistency loss [9][19]. - The second stage involves directly fine-tuning the pre-trained encoder with a randomly initialized decoder, simplifying the training process [13][19]. Performance and Scalability - EPG's framework is similar to classic image classification tasks, significantly lowering the barriers for developing and applying downstream generation tasks [14][19]. - The inference performance of EPG-trained diffusion models is efficient, requiring only 75 forward computations to achieve optimal results, showcasing excellent scalability [18]. Conclusion - The introduction of the EPG framework provides a new, efficient, and VAE-independent approach to training pixel space generative models, achieving superior training efficiency and generation quality [19]. - EPG's "de-VAE" paradigm is expected to drive further exploration and application in generative AI, lowering development barriers and fostering innovation [19].