DiT

Search documents
Diffusion 一定比自回归更有机会实现大一统吗?
机器之心· 2025-08-31 01:30
Group 1 - The article discusses the potential of Diffusion models to achieve a unified architecture in AI, suggesting that they may surpass autoregressive (AR) models in this regard [7][8][9] - It highlights the importance of multimodal capabilities in AI development, emphasizing that a unified model is crucial for understanding and generating heterogeneous data types [8][9] - The article notes that while AR architectures have dominated the field, recent breakthroughs in Diffusion Language Models (DLM) in natural language processing (NLP) are prompting a reevaluation of Diffusion's potential [8][9][10] Group 2 - The article explains that Diffusion models support parallel generation and fine-grained control, which are capabilities that AR models struggle to achieve [9][10] - It outlines the fundamental differences between AR and Diffusion architectures, indicating that Diffusion serves as a powerful compression framework with inherent support for multiple compression modes [11]
DiT在数学和形式上是错的?谢赛宁回应:不要在脑子里做科学
机器之心· 2025-08-20 04:26
Core Viewpoint - The article discusses criticisms of the DiT model, highlighting potential architectural flaws and the introduction of a new method called TREAD that significantly improves training efficiency and image generation quality compared to DiT [1][4][6]. Group 1 - A recent post on X claims that DiT has architectural defects, sparking significant discussion [1]. - The TREAD method achieves a training speed improvement of 14/37 times on the FID metric when applied to the DiT backbone network, indicating better generation quality [2][6]. - The post argues that DiT's FID stabilizes too early during training, suggesting it may have "latent architectural defects" that prevent further learning from data [4]. Group 2 - TREAD employs a "token routing" mechanism to enhance training efficiency without altering the model architecture, using a partial token set to save information and reduce computational costs [6]. - The author of the original DiT paper, Sseining, acknowledges the criticisms and emphasizes the importance of experimental validation over theoretical assertions [28][33]. - Sseining also points out that DiT's architecture has some inherent flaws, particularly in its use of post-layer normalization, which is known to be unstable for tasks with significant numerical range variations [13][36]. Group 3 - The article mentions that DiT's design relies on a simple MLP network for processing critical conditional data, which limits its expressive power [16]. - Sseining highlights that the real issue with DiT lies in its sd-vae component, which is inefficient and has been overlooked for a long time [36]. - The ongoing debate around DiT reflects the iterative nature of algorithmic progress, where existing models are continuously questioned and improved [38].