VAE(变分自编码器)
Search documents
VAE时代终结?谢赛宁团队「RAE」登场,表征自编码器或成DiT训练新基石
机器之心· 2025-10-14 08:24
Core Insights - The article discusses the emergence of RAE (Representation Autoencoders) as a potential replacement for VAE (Variational Autoencoders) in the field of generative models, highlighting the advancements made by the research team led by Assistant Professor Xie Saining from New York University [1][2]. Group 1: RAE Development - RAE combines pre-trained representation encoders (like DINO, SigLIP, MAE) with trained decoders to replace traditional VAE, achieving high-quality reconstruction and a semantically rich latent space [2][6]. - The new model structure addresses the limitations of VAE, such as weak representation capabilities and high computational costs associated with SD-VAE [4][13]. Group 2: Performance Metrics - RAE demonstrates superior performance in image generation tasks, achieving an FID score of 1.51 at a resolution of 256×256 without guidance, and 1.13 with guidance at both 256×256 and 512×512 resolutions [5][6]. - The study shows that RAE consistently outperforms SD-VAE in reconstruction quality, with rFID scores indicating better performance across various encoder configurations [18][20]. Group 3: Training and Architecture - The research introduces a new variant of DiT (Diffusion Transformer), named DiT^DH, which incorporates a lightweight, wide head structure to enhance the model's efficiency without significantly increasing computational costs [3][34]. - The training scheme for the RAE decoder involves using a frozen representation encoder and a ViT-based decoder, achieving reconstruction quality comparable to or better than SD-VAE [12][14]. Group 4: Scalability and Efficiency - DiT^DH exhibits improved convergence speed and computational efficiency compared to standard DiT, maintaining performance advantages across different scales of RAE [36][40]. - The model's scalability is highlighted, with DiT^DH-XL achieving a new state-of-the-art FID score of 1.13 after 400 epochs, outperforming previous models while requiring significantly less computational power [41][43]. Group 5: Noise Management Techniques - The research proposes noise-enhanced decoding to improve the robustness of the decoder against out-of-distribution challenges, which enhances the model's overall performance [29][30]. - Adjustments to noise scheduling based on the effective data dimensions of RAE are shown to significantly improve training outcomes, demonstrating the necessity of tailored noise strategies in high-dimensional latent spaces [28].