Workflow
3x3 Convolution
icon
Search documents
告别Transformer!北大、北邮、华为开源纯卷积DiC:3x3卷积实现SOTA性能,比DiT快5倍!
机器之心· 2025-07-11 08:27
Core Viewpoint - The article discusses a new convolution-based diffusion model called DiC (Diffusion CNN) developed by researchers from Peking University, Beijing University of Posts and Telecommunications, and Huawei, which outperforms the popular Diffusion Transformer (DiT) in both performance and inference speed [1][5][24]. Group 1: Introduction and Background - The AI-generated content (AIGC) field has predominantly adopted transformer-based diffusion models, which, while powerful, come with significant computational costs and slow inference speeds [4]. - The researchers challenge the notion that transformer architectures are the only viable path for generative models by reverting to the classic 3x3 convolution [5][9]. Group 2: Technical Innovations - The choice of 3x3 convolution is justified by its excellent hardware support and optimization, making it a key operator for achieving high throughput [8]. - DiC employs a U-Net Hourglass architecture, which is found to be more effective than the traditional transformer stacking architecture, allowing for broader coverage of the original image area [13]. - A series of optimizations, including stage-specific embeddings, optimal injection points for conditional information, and conditional gating mechanisms, enhance the model's ability to utilize conditional information effectively [14][15]. Group 3: Experimental Results - DiC demonstrates superior performance metrics compared to DiT, achieving a FID score of 13.11 and an IS score of 100.15, significantly better than DiT-XL/2's FID score of 20.05 and IS score of 66.74 [17][18]. - The throughput of DiC-XL reaches 313.7, nearly five times that of DiT-XL/2, showcasing its efficiency in inference speed [18]. - DiC's convergence speed is ten times faster than DiT under the same conditions, indicating its potential for rapid training [18][19]. Group 4: Conclusion and Future Outlook - The emergence of DiC challenges the prevailing belief that generative models must rely on self-attention mechanisms, demonstrating that simple and efficient convolutional networks can still build powerful generative models [24].