DiT(Diffusion Transformers)

Search documents
DiT突遭怒喷,谢赛宁淡定回应
量子位· 2025-08-20 07:48
Core Viewpoint - The article discusses the recent criticisms of the DiT (Diffusion Transformers) model, which is considered a cornerstone in the diffusion model field, highlighting the importance of scientific scrutiny and empirical validation in research [3][10]. Group 1: Criticism of DiT - A user has raised multiple concerns about DiT, claiming it is flawed both mathematically and in its structure, even questioning the presence of Transformer elements in DiT [4][12]. - The criticisms are based on a paper titled "TREAD: Token Routing for Efficient Architecture-agnostic Diffusion Training," which introduces a strategy that allows early-layer tokens to be passed to deeper layers without modifying the architecture or adding parameters [12][14]. - The critic argues that the rapid decrease in FID (Fréchet Inception Distance) during training indicates that DiT's architecture has inherent properties that allow it to easily learn the dataset [15]. - The Tread model reportedly trains 14 times faster than DiT after 400,000 iterations and 37 times faster at its best performance after 7 million iterations, suggesting that significant performance improvements may undermine previous methods [16][17]. - The critic also suggests that if parts of the network are disabled during training, it could render the network ineffective [19]. - It is noted that the more network units in DiT that are replaced with identity mappings during training, the better the model evaluation results [20]. - The architecture of DiT is said to require logarithmic scaling to represent the signal-to-noise ratio differences during the diffusion process, indicating potential issues with output dynamics [23]. - Concerns are raised regarding the Adaptive Layer Normalization method, suggesting that DiT processes conditional inputs through a standard MLP (Multi-Layer Perceptron) without clear Transformer characteristics [25][26]. Group 2: Response from Xie Saining - Xie Saining, the author of DiT, responded to the criticisms, asserting that the Tread model's findings do not invalidate DiT [27]. - He acknowledges the Tread model's contributions but emphasizes that its effectiveness is due to regularization enhancing feature robustness, not because DiT is incorrect [28]. - Saining highlights that Lightning DiT, an upgraded version of DiT, remains a powerful option and should be prioritized when conditions allow [29]. - He also states that there is no evidence to suggest that the post-layer normalization in DiT causes issues [30]. - Saining summarizes improvements made over the past year, focusing on internal representation learning and various methods for enhancing model training [32]. - He mentions that the sd-vae (stochastic depth variational autoencoder) is a significant concern for DiT, particularly regarding its high computational cost for processing images at 256×256 resolution [34].
新范式来了!新能量模型打破Transformer++扩展上限,训练扩展率快35%
机器之心· 2025-07-07 04:48
Core Insights - The article discusses the development of Energy-Based Transformers (EBTs) that can learn to think independently through unsupervised learning, enhancing the model's reasoning capabilities akin to human System 2 thinking [9][10]. Group 1: System 2 Thinking and Model Development - Human thinking is categorized into System 1 (fast thinking) and System 2 (slow thinking), with the latter being crucial for complex tasks [3][4]. - Current large language models excel in System 1 tasks but struggle with System 2 tasks, prompting researchers to explore methods to enhance System 2 reasoning [4][5]. - EBTs are designed to assign energy values to input and candidate predictions, optimizing through gradient descent to simulate a thinking process [9][10]. Group 2: Performance and Scalability - EBTs demonstrate a 35% faster scalability in training compared to mainstream Transformer++ methods across various metrics such as data volume and model depth [11]. - In reasoning tasks, EBTs outperform Transformer++ by 29% in language tasks, indicating superior performance with increased computational effort [12]. - EBTs also excel in image denoising tasks, requiring fewer forward passes than diffusion Transformers while achieving better results [13]. Group 3: Generalization and Robustness - EBTs show enhanced generalization capabilities, particularly when handling out-of-distribution data, outperforming existing models even with similar or worse pre-training performance [14]. - The model's ability to learn and express uncertainty in predictions is highlighted, with EBTs effectively capturing the difficulty of token predictions [62][65]. - EBTs exhibit a linear trend in performance improvement as the distribution shift increases, emphasizing their critical role in cross-distribution generalization tasks [68][69]. Group 4: Experimental Results and Comparisons - EBTs outperform Transformer++ in various scalability metrics, including data efficiency and computational efficiency, suggesting they will excel in large-scale training scenarios [46][72]. - Despite slightly higher pre-training perplexity, EBTs achieve lower perplexity in downstream tasks, indicating stronger generalization capabilities [74]. - In image denoising tasks, EBTs significantly outperform DiT models, achieving better peak signal-to-noise ratios (PSNR) with 99% fewer forward passes [81][92].