Workflow
LayerNorm
icon
Search documents
DiT在数学和形式上是错的?谢赛宁回应:不要在脑子里做科学
机器之心· 2025-08-20 04:26
Core Viewpoint - The article discusses criticisms of the DiT model, highlighting potential architectural flaws and the introduction of a new method called TREAD that significantly improves training efficiency and image generation quality compared to DiT [1][4][6]. Group 1 - A recent post on X claims that DiT has architectural defects, sparking significant discussion [1]. - The TREAD method achieves a training speed improvement of 14/37 times on the FID metric when applied to the DiT backbone network, indicating better generation quality [2][6]. - The post argues that DiT's FID stabilizes too early during training, suggesting it may have "latent architectural defects" that prevent further learning from data [4]. Group 2 - TREAD employs a "token routing" mechanism to enhance training efficiency without altering the model architecture, using a partial token set to save information and reduce computational costs [6]. - The author of the original DiT paper, Sseining, acknowledges the criticisms and emphasizes the importance of experimental validation over theoretical assertions [28][33]. - Sseining also points out that DiT's architecture has some inherent flaws, particularly in its use of post-layer normalization, which is known to be unstable for tasks with significant numerical range variations [13][36]. Group 3 - The article mentions that DiT's design relies on a simple MLP network for processing critical conditional data, which limits its expressive power [16]. - Sseining highlights that the real issue with DiT lies in its sd-vae component, which is inefficient and has been overlooked for a long time [36]. - The ongoing debate around DiT reflects the iterative nature of algorithmic progress, where existing models are continuously questioned and improved [38].
X @Polyhedra
Polyhedra· 2025-08-11 09:34
7/Key insight:Don’t just naively compile an LLM into a circuit.Exploit structure:- Linear ops (MatMul, LayerNorm) → custom efficient constraints.- Nonlinear ops (GELU) → fused constraints to slash complexity.- Parallel-friendly layout to max out modern prover hardware. ...