Workflow
RAE(表征自编码器)
icon
Search documents
VAE时代终结?谢赛宁团队「RAE」登场,表征自编码器或成DiT训练新基石
机器之心· 2025-10-14 08:24
Core Insights - The article discusses the emergence of RAE (Representation Autoencoders) as a potential replacement for VAE (Variational Autoencoders) in the field of generative models, highlighting the advancements made by the research team led by Assistant Professor Xie Saining from New York University [1][2]. Group 1: RAE Development - RAE combines pre-trained representation encoders (like DINO, SigLIP, MAE) with trained decoders to replace traditional VAE, achieving high-quality reconstruction and a semantically rich latent space [2][6]. - The new model structure addresses the limitations of VAE, such as weak representation capabilities and high computational costs associated with SD-VAE [4][13]. Group 2: Performance Metrics - RAE demonstrates superior performance in image generation tasks, achieving an FID score of 1.51 at a resolution of 256×256 without guidance, and 1.13 with guidance at both 256×256 and 512×512 resolutions [5][6]. - The study shows that RAE consistently outperforms SD-VAE in reconstruction quality, with rFID scores indicating better performance across various encoder configurations [18][20]. Group 3: Training and Architecture - The research introduces a new variant of DiT (Diffusion Transformer), named DiT^DH, which incorporates a lightweight, wide head structure to enhance the model's efficiency without significantly increasing computational costs [3][34]. - The training scheme for the RAE decoder involves using a frozen representation encoder and a ViT-based decoder, achieving reconstruction quality comparable to or better than SD-VAE [12][14]. Group 4: Scalability and Efficiency - DiT^DH exhibits improved convergence speed and computational efficiency compared to standard DiT, maintaining performance advantages across different scales of RAE [36][40]. - The model's scalability is highlighted, with DiT^DH-XL achieving a new state-of-the-art FID score of 1.13 after 400 epochs, outperforming previous models while requiring significantly less computational power [41][43]. Group 5: Noise Management Techniques - The research proposes noise-enhanced decoding to improve the robustness of the decoder against out-of-distribution challenges, which enhances the model's overall performance [29][30]. - Adjustments to noise scheduling based on the effective data dimensions of RAE are shown to significantly improve training outcomes, demonstrating the necessity of tailored noise strategies in high-dimensional latent spaces [28].
谢赛宁新作:VAE退役,RAE当立
量子位· 2025-10-14 08:16
Core Viewpoint - The era of Variational Autoencoders (VAE) is coming to an end, with Representation Autoencoders (RAE) set to take over in the field of diffusion models [1][3]. Summary by Sections RAE Introduction - RAE is a new type of autoencoder designed for training diffusion Transformers (DiT), utilizing pre-trained representation encoders (like DINO, SigLIP, MAE) paired with lightweight decoders, replacing the traditional VAE [3][9]. Advantages of RAE - RAE provides high-quality reconstruction results and a semantically rich latent space, supporting scalable transformer-based architectures. It achieves faster convergence without the need for additional representation alignment losses [4][10]. Performance Metrics - At a resolution of 256×256, the FID score without guidance is 1.51, and with guidance, it is 1.13 for both 256×256 and 512×512 resolutions [6]. Limitations of VAE - VAE has outdated backbone networks, leading to overly complex architectures, requiring 450 GFLOPs compared to only 22 GFLOPs for a simple ViT-B encoder [7]. - The compressed latent space of VAE (only 4 channels) severely limits information capacity, resulting in minimal improvement in information carrying ability [7]. - VAE's weak representation capability, relying solely on reconstruction training, leads to low feature quality and slows down convergence, negatively impacting generation quality [7]. RAE's Design and Training - RAE combines pre-trained representation encoders with trained decoders without requiring additional training or alignment phases, and it does not introduce auxiliary loss functions [9]. - RAE outperforms SD-VAE in reconstruction quality despite its simplicity [10]. Model Comparisons - RAE models such as DINOv2-B, SigLIP2-B, and MAE-B show significant improvements in rFID and Top-1 accuracy compared to SD-VAE [11]. Adjustments for Diffusion Models - RAE requires simple adjustments for effective performance in high-dimensional spaces, including a wide DiT design, noise scheduling, and noise injection in the decoder training [13][17]. - The DiT-XL model trained with RAE surpasses REPA without any auxiliary losses or additional training phases, achieving convergence speeds up to 16 times faster than REPA based on SD-VAE [18][19]. Scalability and Efficiency - The new architecture enhances the scalability of DiT in terms of training computation and model size, outperforming both standard DiT based on RAE and traditional methods based on VAE [24].