表征对齐
Search documents
谢赛宁REPA得到大幅改进,只需不到4行代码
机器之心· 2025-12-13 04:59
Core Insights - The article discusses the importance of spatial structure over global semantic information in representation alignment for generative models, specifically in the context of diffusion models [1][3][42]. Group 1: Research Findings - A joint team from Adobe Research, Australian National University, and New York University conducted empirical analysis on 27 different visual encoders and model sizes [2]. - The unexpected result revealed that spatial structure, rather than global performance, drives the generative performance of target representations [3][8]. - The study introduced the concept of Spatial Self-Similarity to quantify spatial structure, which measures the clarity of "texture" and "relationships" in feature maps [15][17]. Group 2: iREPA Methodology - The team developed a simple method called iREPA, which can enhance the convergence speed of various visual encoders and training variants [5][20]. - iREPA's core modifications include replacing the MLP projection layer with a convolutional layer to better preserve local spatial relationships and introducing a spatial normalization layer to enhance spatial contrast [20][21][22]. Group 3: Performance Improvements - iREPA demonstrated significant improvements in convergence speed across various diffusion transformers and visual encoders, proving its robustness and general applicability [26][27]. - The method showed that as the model size increases, the performance gains from iREPA also increase, aligning with the "Scaling Law" trend [34]. - Visual quality improvements were evident, with iREPA-generated images exhibiting better object outlines, texture details, and overall structural coherence compared to standard REPA [36]. Group 4: Conclusion - The research emphasizes that understanding spatial relationships between pixels is more crucial for generative models than merely focusing on a single metric like ImageNet accuracy [42].