舍弃 VAE，预训练语义编码器能让 Diffusion 走得更远吗？

Group 1 - The article discusses the limitations of Variational Autoencoders (VAE) in the diffusion model paradigm and explores the potential of using pretrained semantic encoders to enhance diffusion processes [1][7][8] - The shift from VAE to pretrained semantic encoders like DINO and MAE aims to address issues such as semantic entanglement, computational efficiency, and the disconnection between generative and perceptual tasks [9][10][11] - RAE and SVG are two approaches that prioritize semantic representation over compression, leveraging the strong prior knowledge from pretrained visual models to improve efficiency and generative quality [10][11] Group 2 - The article highlights the trend of moving from static image generation to more complex multimodal content, indicating that the traditional VAE + diffusion framework is becoming a bottleneck for next-generation generative models [8][9] - The computational burden of VAE is significant, with examples showing that the VAE encoder in Stable Diffusion 2.1 requires 135.59 GFLOPs, surpassing the 86.37 GFLOPs needed for the core diffusion U-Net network [8][9] - The discussion includes the implications of the "lazy and rich" business principle in the AI era, suggesting a shift in value from knowledge storage to "anti-consensus" thinking among human experts [3]