Workflow
RAE
icon
Search documents
RAE+VAE? 预训练表征助力扩散模型Tokenizer,加速像素压缩到语义提取
机器之心· 2025-11-13 10:03
Core Insights - The article discusses the introduction of RAE (Diffusion Transformers with Representation Autoencoders) and VFM-VAE by Xi'an Jiaotong University and Microsoft Research Asia, which utilize "frozen pre-trained visual representations" to enhance the performance of diffusion models in generating images [2][6][28]. Group 1: VFM-VAE Overview - VFM-VAE combines the probabilistic modeling mechanism of VAE with RAE, systematically studying the impact of compressed pre-trained visual representations on the structure and performance of LDM systems [2][6]. - The integration of frozen foundational visual models as Tokenizers in VFM-VAE significantly accelerates model convergence and improves generation quality, marking an evolution from pixel compression to semantic representation [2][6]. Group 2: Performance Analysis - Experimental results indicate that the distillation-based Tokenizers struggle with semantic alignment under perturbations, while maintaining high consistency between latent space and foundational visual model features is crucial for robustness and convergence efficiency [8][19]. - VFM-VAE demonstrates superior performance and training efficiency, achieving a gFID of 3.80 on ImageNet 256×256, outperforming the distillation route's 5.14, and reaching a gFID of 2.22 with explicit alignment in just 80 epochs, improving training efficiency by approximately 10 times [23][24]. Group 3: Semantic Representation and Alignment - The research team introduced the SE-CKNNA metric to quantify the consistency between latent space and foundational visual model features, which is essential for evaluating the impact on subsequent generation performance [7][19]. - VFM-VAE maintains a higher average and peak CKNNA score compared to distillation-based Tokenizers, indicating a more stable alignment of latent space with foundational visual model features [19][21]. Group 4: Future Directions - The article concludes with the potential for further exploration of latent space in multimodal generation and complex visual understanding, aiming to transition from pixel compression to semantic representation [29].
舍弃 VAE,预训练语义编码器能让 Diffusion 走得更远吗?
机器之心· 2025-11-02 01:30
Group 1 - The article discusses the limitations of Variational Autoencoders (VAE) in the diffusion model paradigm and explores the potential of using pretrained semantic encoders to enhance diffusion processes [1][7][8] - The shift from VAE to pretrained semantic encoders like DINO and MAE aims to address issues such as semantic entanglement, computational efficiency, and the disconnection between generative and perceptual tasks [9][10][11] - RAE and SVG are two approaches that prioritize semantic representation over compression, leveraging the strong prior knowledge from pretrained visual models to improve efficiency and generative quality [10][11] Group 2 - The article highlights the trend of moving from static image generation to more complex multimodal content, indicating that the traditional VAE + diffusion framework is becoming a bottleneck for next-generation generative models [8][9] - The computational burden of VAE is significant, with examples showing that the VAE encoder in Stable Diffusion 2.1 requires 135.59 GFLOPs, surpassing the 86.37 GFLOPs needed for the core diffusion U-Net network [8][9] - The discussion includes the implications of the "lazy and rich" business principle in the AI era, suggesting a shift in value from knowledge storage to "anti-consensus" thinking among human experts [3]
VAE再被补刀,清华快手SVG扩散模型亮相,训练提效6200%,生成提速3500%
3 6 Ke· 2025-10-28 07:32
Core Insights - The article discusses the transition from Variational Autoencoders (VAE) to a new model called SVG developed by Tsinghua University and Kuaishou's Keling team, which shows significant improvements in training efficiency and generation speed [1][3]. Group 1: Model Comparison - SVG achieves a 62-fold increase in training efficiency and a 35-fold increase in generation speed compared to traditional VAE methods [1]. - The main issue with VAE is semantic entanglement, where features from different categories are mixed, leading to inefficiencies in training and generation processes [3][5]. - The RAE model focuses solely on generation performance by reusing pre-trained encoders, while SVG aims for both generation and multi-task applicability through a dual-branch feature space [5][6]. Group 2: Technical Innovations - SVG utilizes the DINOv3 pre-trained model for semantic extraction, which effectively captures high-level semantic information, addressing the semantic entanglement issue [8]. - A lightweight residual encoder is added to DINOv3 to recover high-frequency details that are often lost, ensuring a comprehensive feature representation [8]. - The distribution alignment mechanism is crucial for matching the output of the residual encoder with the semantic features from DINOv3, significantly enhancing image generation quality [9]. Group 3: Performance Metrics - Experimental results indicate that removing the distribution alignment mechanism leads to a significant drop in image generation quality, as measured by the FID score [9]. - In training efficiency, the SVG-XL model achieves an FID score of 6.57 after 80 epochs, outperforming the VAE-based SiT-XL model, which has an FID of 22.58 [11]. - The SVG model's feature space can be directly applied to various tasks such as image classification and semantic segmentation without the need for fine-tuning, achieving competitive accuracy metrics [13].
VAE再被补刀!清华快手SVG扩散模型亮相,训练提效6200%,生成提速3500%
量子位· 2025-10-28 05:12
Core Viewpoint - The article discusses the transition from Variational Autoencoders (VAE) to new models like SVG developed by Tsinghua University and Kuaishou, highlighting significant improvements in training efficiency and generation speed, as well as addressing the limitations of VAE in semantic entanglement [1][4][10]. Group 1: VAE Limitations and New Approaches - VAE is being abandoned due to its semantic entanglement issue, where adjusting one feature affects others, complicating the generation process [4][8]. - The SVG model achieves a 62-fold improvement in training efficiency and a 35-fold increase in generation speed compared to traditional methods [3][10]. - The RAE approach focuses solely on enhancing generation performance by reusing pre-trained encoders, while SVG aims for multi-task versatility by constructing a feature space that integrates semantics and details [11][12]. Group 2: SVG Model Details - SVG utilizes the DINOv3 pre-trained model for semantic extraction, effectively distinguishing features of different categories like cats and dogs, thus resolving semantic entanglement [14]. - A lightweight residual encoder is added to capture high-frequency details that DINOv3 may overlook, ensuring a comprehensive feature representation [14]. - The distribution alignment mechanism is crucial for maintaining the integrity of semantic structures while integrating detail features, as evidenced by a significant increase in FID values when this mechanism is removed [15][16]. Group 3: Performance Metrics - In experiments, SVG outperformed traditional VAE models in various metrics, achieving a FID score of 6.57 on the ImageNet dataset after 80 epochs, compared to 22.58 for the VAE-based SiT-XL [18]. - The model's efficiency is further demonstrated with a FID score dropping to 1.92 after 1400 epochs, nearing the performance of top-tier generative models [18]. - SVG's feature space is versatile, allowing for direct application in tasks like image classification and semantic segmentation without the need for fine-tuning, achieving an 81.8% Top-1 accuracy on ImageNet-1K [22].
「我受够了Transformer」:其作者Llion Jones称AI领域已僵化,正错失下一个突破
机器之心· 2025-10-25 03:20
Core Viewpoint - The AI field is experiencing a paradox where increased resources and funding are leading to decreased creativity and innovation, as researchers focus on safe, publishable projects rather than high-risk, transformative ideas [3][11][29]. Group 1: Current State of AI Research - Llion Jones, CTO of Sakana AI and co-author of the influential paper "Attention is All You Need," expressed frustration with the current focus on the Transformer architecture, suggesting it may hinder the search for the next major breakthrough [2][5][24]. - Despite unprecedented investment and talent influx into AI, the field has become narrow-minded, with researchers feeling pressured to compete rather than explore new ideas [3][11][16]. - Jones highlighted that the current environment leads to rushed publications and a lack of true scientific exploration, as researchers are concerned about being "scooped" by competitors [11][16]. Group 2: Historical Context and Comparison - Jones recalled the organic and pressure-free environment that led to the creation of the Transformer, contrasting it with today's competitive atmosphere where researchers feel compelled to deliver quick results [19][30]. - He emphasized that the freedom to explore ideas without pressure from management was crucial for the development of the Transformer, a condition that is now largely absent [19][22]. Group 3: Proposed Solutions and Future Directions - To foster innovation, Jones proposed increasing the "exploration dial" and encouraging researchers to share their findings openly, even at the cost of competition [21][26]. - At Sakana AI, efforts are being made to recreate a research environment that prioritizes exploration over competition, aiming to reduce the pressure to publish [22][30]. - Jones believes that the next significant breakthrough in AI may be overlooked if the current focus on incremental improvements continues, urging a shift towards collaborative exploration [26][31].