Grafting(嫁接)

Search documents
李飞飞团队提出架构设计新思路!无需从头训练,直接“嫁接”预训练模型关键组件
量子位· 2025-06-20 05:53
Core Viewpoint - The article discusses the potential of using pre-trained models as a foundation for exploring new architecture designs, highlighting a method called "Grafting" that allows researchers to modify components of existing models to study new architectures efficiently [1][2][7]. Summary by Sections Introduction to Grafting - Researchers propose "Grafting" as a new approach to reduce the high costs associated with training models from scratch, allowing for efficient exploration of new architectures [2][7]. Focus on DiTs Model - The research centers on the DiTs model, widely used for image and video generation, where a testing platform was built to assess the impact of Grafting on model quality [4][5]. Results of Grafting - Many hybrid designs achieved performance comparable to the original model while utilizing less than 2% of the pre-training computational resources [5][22]. - The application of Grafting to the PixArt-Σ model resulted in a 1.43 times increase in generation speed, with a quality decrease of less than 2% [6][23]. Two-Stage Architecture Editing Method - Grafting employs a two-stage architecture editing method involving Activation Distillation and Lightweight Fine-tuning to modify the pre-trained DiTs [11][16]. Challenges in Implementation - Two main challenges are identified: initializing new operators before integration and mitigating error accumulation from multiple replacements [14][15]. Experimental Results - Three experiments were conducted: 1. **Hybrid Architecture Experiment**: Validated the feasibility of replacements, showing that a 50% replacement of attention layers resulted in only a slight increase in FID score [20]. 2. **Text-to-Image Generation Experiment**: Demonstrated the effectiveness of the new architecture with a significant speed improvement and minimal quality loss [23]. 3. **Parallelization Experiment**: Showed that restructuring the model into parallel blocks improved generation quality while reducing depth [25][26]. Limitations and Future Potential - The research is limited to the DiT-XL/2 model and specific replacement types, which may affect the generalizability of the findings [27]. - Despite limitations, Grafting shows significant potential for exploring new model architectures, especially in resource-constrained environments [28].
李飞飞团队新作:DiT不训练直接改架构,模型深度减半,质量还提高了
机器之心· 2025-06-10 08:41
Core Insights - The article discusses a technique called "Grafting" that allows researchers to explore new model architecture designs by editing pre-trained Diffusion Transformers (DiTs) without starting from scratch [1][5][15] - Model architecture design is crucial in machine learning, defining model functions, operator selections, and configuration settings [2] - The high cost of training models from scratch presents challenges in researching new architectures, particularly for generative models [3][4] Grafting Process - The grafting process consists of two main stages: 1. Activation Distillation: This stage transfers the functionality of original operators to new operators through regression objectives [6] 2. Lightweight Fine-tuning: This stage uses limited data for tuning to mitigate error propagation caused by integrating multiple new operators [7][18] Experimental Findings - A testing platform based on DiT-XL/2 was developed to study the impact of grafting on model quality [11] - Grafting led to the development of hybrid designs that replaced Softmax attention with gated convolutions, local attention, and linear attention, achieving good quality with less than 2% of pre-training computational resources [12][13] - A case study demonstrated that grafting could convert sequential Transformer modules into parallel modules, halving the model depth while achieving higher quality than other models of the same depth [14] Self-Grafting - Self-grafting is introduced as a control setup where existing operators are replaced with the same type but with randomly initialized weights, allowing for performance benchmarking [21] - The choice of regression objectives significantly affects performance, with specific objectives yielding better initialization quality [25][26][27] Experimental Results on MHA and MLP - The experiments showed that grafting is effective for constructing efficient hybrid architectures with good generative quality under smaller computational budgets [41] - The grafting model achieved a 1.43x speedup in real-time computation while maintaining minimal loss in generative quality [42][43]