李飞飞团队新作：DiT不训练直接改架构，模型深度减半，质量还提高了

Core Insights - The article discusses a technique called "Grafting" that allows researchers to explore new model architecture designs by editing pre-trained Diffusion Transformers (DiTs) without starting from scratch [1][5][15] - Model architecture design is crucial in machine learning, defining model functions, operator selections, and configuration settings [2] - The high cost of training models from scratch presents challenges in researching new architectures, particularly for generative models [3][4] Grafting Process - The grafting process consists of two main stages: 1. Activation Distillation: This stage transfers the functionality of original operators to new operators through regression objectives [6] 2. Lightweight Fine-tuning: This stage uses limited data for tuning to mitigate error propagation caused by integrating multiple new operators [7][18] Experimental Findings - A testing platform based on DiT-XL/2 was developed to study the impact of grafting on model quality [11] - Grafting led to the development of hybrid designs that replaced Softmax attention with gated convolutions, local attention, and linear attention, achieving good quality with less than 2% of pre-training computational resources [12][13] - A case study demonstrated that grafting could convert sequential Transformer modules into parallel modules, halving the model depth while achieving higher quality than other models of the same depth [14] Self-Grafting - Self-grafting is introduced as a control setup where existing operators are replaced with the same type but with randomly initialized weights, allowing for performance benchmarking [21] - The choice of regression objectives significantly affects performance, with specific objectives yielding better initialization quality [25][26][27] Experimental Results on MHA and MLP - The experiments showed that grafting is effective for constructing efficient hybrid architectures with good generative quality under smaller computational budgets [41] - The grafting model achieved a 1.43x speedup in real-time computation while maintaining minimal loss in generative quality [42][43]