Workflow
图像生成技术
icon
Search documents
天下苦VAE久矣:阿里高德提出像素空间生成模型训练范式, 彻底告别VAE依赖
量子位· 2025-10-29 02:39
Core Insights - The article discusses the rapid development of image generation technology based on diffusion models, highlighting the limitations of the Variational Autoencoder (VAE) and introducing the EPG framework as a solution [1][19]. Training Efficiency and Generation Quality - EPG demonstrates significant improvements in training efficiency and generation quality, achieving a FID of 2.04 and 2.35 on ImageNet-256 and ImageNet-512 datasets, respectively, with only 75 model forward computations [3][19]. - Compared to the mainstream VAE-based models like DiT and SiT, EPG requires significantly less pre-training and fine-tuning time, with 57 hours for pre-training and 139 hours for fine-tuning, versus 160 hours and 506 hours for DiT [7]. Consistency Model Training - EPG successfully trains a consistency model in pixel space without relying on VAE or pre-trained diffusion model weights, achieving a FID of 8.82 on ImageNet-256 [5][19]. Training Complexity and Costs - The VAE's training complexity arises from the need to balance compression rate and reconstruction quality, making it challenging [6]. - Fine-tuning costs are high when adapting to new domains, as poor performance of the pre-trained VAE necessitates retraining the entire model, increasing development time and costs [6]. Two-Stage Training Method - EPG employs a two-stage training method: self-supervised pre-training (SSL Pre-training) and end-to-end fine-tuning, decoupling representation learning from pixel reconstruction [8][19]. - The first stage focuses on extracting high-quality visual features from noisy images using a contrastive loss and representation consistency loss [9][19]. - The second stage involves directly fine-tuning the pre-trained encoder with a randomly initialized decoder, simplifying the training process [13][19]. Performance and Scalability - EPG's framework is similar to classic image classification tasks, significantly lowering the barriers for developing and applying downstream generation tasks [14][19]. - The inference performance of EPG-trained diffusion models is efficient, requiring only 75 forward computations to achieve optimal results, showcasing excellent scalability [18]. Conclusion - The introduction of the EPG framework provides a new, efficient, and VAE-independent approach to training pixel space generative models, achieving superior training efficiency and generation quality [19]. - EPG's "de-VAE" paradigm is expected to drive further exploration and application in generative AI, lowering development barriers and fostering innovation [19].
阿里图像生成模型登顶 HuggingFace,一句话把马斯克“变老”
3 6 Ke· 2025-08-20 08:34
Core Insights - Alibaba has launched Qwen-Image, an image generation foundational model designed to tackle complex text rendering and precise image editing challenges through systematic data engineering and advanced training paradigms [1][4] - The model aims to enhance the understanding and alignment capabilities of complex, multi-dimensional text instructions in image generation tasks, addressing long-standing challenges in the AI field [3][5] Data Processing and Model Architecture - Qwen-Image employs a comprehensive data processing system that collects billions of high-quality text-image pairs, emphasizing quality over quantity, and utilizes a seven-stage filtering pipeline to enhance data quality and alignment [5][6] - The model features a dual encoding design, utilizing high-level semantic features and low-level reconstruction features to balance semantic coherence and visual fidelity during image editing [6][5] Training and Performance - The training process is progressive, moving from low-resolution to high-resolution images, and incorporates reinforcement learning methods to optimize the quality of generated results and adherence to instructions [6][5] - Benchmark tests and human evaluations indicate that Qwen-Image achieves industry-leading performance in general image generation, complex text rendering, and directive image editing tasks [6] Comparison with Traditional Tools - Qwen-Image exhibits core editing capabilities similar to Photoshop but operates through natural language instructions rather than manual tools, allowing users to describe edits instead of executing them through traditional methods [25][26] - The model's ability to understand and execute complex instructions, such as adjusting poses while maintaining visual and semantic consistency, surpasses traditional tools that require manual adjustments [26][27] User Experience and Accessibility - Qwen-Image lowers the technical barrier for image editing by enabling users to express visual intentions through clear language, contrasting with Photoshop's requirement for mastery of complex tools and color theory [28][29] - While Qwen-Image is not a direct replacement for Photoshop, it represents a new paradigm in image content creation and editing, catering to different user needs and scenarios [29]
腾讯混元图像2将于5月16日发布
news flash· 2025-05-15 06:58
Core Viewpoint - Tencent's new imaging product, Tencent Mix Yuan Image 2, is set to be launched on May 16 [1] Group 1 - The launch date for Tencent Mix Yuan Image 2 is confirmed for May 16 [1]