Workflow
DiT
icon
Search documents
NeurIPS'25 Oral:何必DiT,字节首次拿着自回归,单GPU一分钟生成5秒720p视频
3 6 Ke· 2025-11-14 08:35
Core Insights - InfinityStar, developed by ByteDance's commercialization technology team, presents a new method for video generation that balances quality and efficiency, addressing challenges in computational complexity and resource consumption [2][3][24] Group 1: InfinityStar Highlights - InfinityStar is the first discrete autoregressive video generator to surpass diffusion models on VBench [3] - It eliminates delays in video generation, transitioning from a slow denoising process to a faster autoregressive approach [3] - The method supports various tasks including text-to-image, text-to-video, image-to-video, and interactive long video generation [3] Group 2: Technical Innovations - The core architecture of InfinityStar utilizes a spatiotemporal pyramid modeling approach, allowing it to unify image and video tasks while being an order of magnitude faster than mainstream diffusion models [9] - The model decomposes video into two parts: the first frame captures static appearance information, while subsequent segments focus on dynamic changes [10][11] - InfinityStar employs an efficient visual tokenizer and introduces techniques like knowledge inheritance and stochastic quantizer depth to enhance training speed and model performance [14][15] Group 3: Performance Metrics - InfinityStar demonstrates superior performance in text-to-image (T2I) and text-to-video (T2V) tasks, achieving excellent results on GenEval, DPG, and VBench benchmarks, outperforming previous autoregressive models and diffusion-based methods [18][21][24] - Specifically, in the VBench benchmark, InfinityStar achieved a human preference evaluation score that surpassed HunyuanVideo, particularly excelling in instruction adherence [22][24] Group 4: Efficiency - The generation speed of InfinityStar is significantly faster than that of DiT-based methods, capable of producing a 5-second 720p video in under one minute on a single GPU [24]
Diffusion 一定比自回归更有机会实现大一统吗?
机器之心· 2025-08-31 01:30
Group 1 - The article discusses the potential of Diffusion models to achieve a unified architecture in AI, suggesting that they may surpass autoregressive (AR) models in this regard [7][8][9] - It highlights the importance of multimodal capabilities in AI development, emphasizing that a unified model is crucial for understanding and generating heterogeneous data types [8][9] - The article notes that while AR architectures have dominated the field, recent breakthroughs in Diffusion Language Models (DLM) in natural language processing (NLP) are prompting a reevaluation of Diffusion's potential [8][9][10] Group 2 - The article explains that Diffusion models support parallel generation and fine-grained control, which are capabilities that AR models struggle to achieve [9][10] - It outlines the fundamental differences between AR and Diffusion architectures, indicating that Diffusion serves as a powerful compression framework with inherent support for multiple compression modes [11]
DiT在数学和形式上是错的?谢赛宁回应:不要在脑子里做科学
机器之心· 2025-08-20 04:26
Core Viewpoint - The article discusses criticisms of the DiT model, highlighting potential architectural flaws and the introduction of a new method called TREAD that significantly improves training efficiency and image generation quality compared to DiT [1][4][6]. Group 1 - A recent post on X claims that DiT has architectural defects, sparking significant discussion [1]. - The TREAD method achieves a training speed improvement of 14/37 times on the FID metric when applied to the DiT backbone network, indicating better generation quality [2][6]. - The post argues that DiT's FID stabilizes too early during training, suggesting it may have "latent architectural defects" that prevent further learning from data [4]. Group 2 - TREAD employs a "token routing" mechanism to enhance training efficiency without altering the model architecture, using a partial token set to save information and reduce computational costs [6]. - The author of the original DiT paper, Sseining, acknowledges the criticisms and emphasizes the importance of experimental validation over theoretical assertions [28][33]. - Sseining also points out that DiT's architecture has some inherent flaws, particularly in its use of post-layer normalization, which is known to be unstable for tasks with significant numerical range variations [13][36]. Group 3 - The article mentions that DiT's design relies on a simple MLP network for processing critical conditional data, which limits its expressive power [16]. - Sseining highlights that the real issue with DiT lies in its sd-vae component, which is inefficient and has been overlooked for a long time [36]. - The ongoing debate around DiT reflects the iterative nature of algorithmic progress, where existing models are continuously questioned and improved [38].