Workflow
REPA
icon
Search documents
推特吵架吵出篇论文,谢赛宁团队新作iREPA只要3行代码
3 6 Ke· 2025-12-16 09:42
Core Insights - The online debate initiated by a Twitter user led to the development of a complete academic paper, demonstrating the potential of collaborative discussions in academia [2][4][15]. Group 1: Academic Discussion and Collaboration - The initial discussion emphasized the need for self-supervised learning (SSL) models to focus on dense tasks rather than solely on classification scores from ImageNet-1K [4]. - The debate involved various participants, including a notable contribution from a user who suggested a comparative analysis between different models [11]. - The outcome of the discussion was a paper that provided deeper insights into the relationship between representation quality and generative performance [15]. Group 2: Research Findings - The paper concluded that spatial structure, rather than global semantic information, is the primary driver of generative performance in models [18]. - It was found that larger visual encoders do not necessarily lead to better generation results; in fact, encoders with lower accuracy could outperform those with higher accuracy [18][21]. - The research highlighted the importance of spatial information, showing that even classic spatial features like SIFT and HOG can provide competitive improvements [22]. Group 3: Methodological Innovations - The study proposed modifications to the existing representation alignment framework (REPA), introducing iREPA, which enhances spatial structure retention [24]. - Simple changes, such as replacing the standard MLP projection layer with a convolutional layer, were shown to significantly improve performance [25]. - iREPA can be easily integrated into various representation alignment methods with minimal code, facilitating faster convergence across different training schemes [25].
推特吵架吵出篇论文!谢赛宁团队新作iREPA只要3行代码
量子位· 2025-12-16 05:58
Core Viewpoint - The article discusses the emergence of a new academic paper, iREPA, which was inspired by an online debate about self-supervised learning (SSL) models and their application to dense tasks, emphasizing the importance of spatial structure over global semantic information in generating quality representations [3][17][25]. Group 1: Background and Development - The discussion that led to the iREPA paper originated from a debate on Twitter, where a user argued that SSL models should focus on dense tasks rather than global classification scores [8][12]. - Following the debate, multiple teams collaborated to produce a complete paper based on the initial discussion, which only required three lines of code to implement [3][30]. Group 2: Key Findings - The research concluded that better global semantic information does not equate to better generation quality; instead, spatial structure is the primary driver of representation generation performance [25][30]. - It was found that visual encoders with lower linear detection accuracy (around 20%) could outperform those with higher accuracy (over 80%) in generating quality representations [25]. Group 3: Methodology and Innovations - The study involved a large-scale quantitative correlation analysis covering 27 different visual encoders and three model sizes, highlighting the significance of spatial information [26][28]. - The iREPA framework was proposed as an improvement to the existing representation alignment (REPA) framework, featuring modifications such as replacing the standard MLP projection layer with a convolutional layer and introducing a spatial normalization layer [30][31]. Group 4: Practical Implications - iREPA can be easily integrated into any representation alignment method with minimal code changes, and it shows improved performance across various training schemes [32].
谢赛宁REPA得到大幅改进,只需不到4行代码
机器之心· 2025-12-13 04:59
Core Insights - The article discusses the importance of spatial structure over global semantic information in representation alignment for generative models, specifically in the context of diffusion models [1][3][42]. Group 1: Research Findings - A joint team from Adobe Research, Australian National University, and New York University conducted empirical analysis on 27 different visual encoders and model sizes [2]. - The unexpected result revealed that spatial structure, rather than global performance, drives the generative performance of target representations [3][8]. - The study introduced the concept of Spatial Self-Similarity to quantify spatial structure, which measures the clarity of "texture" and "relationships" in feature maps [15][17]. Group 2: iREPA Methodology - The team developed a simple method called iREPA, which can enhance the convergence speed of various visual encoders and training variants [5][20]. - iREPA's core modifications include replacing the MLP projection layer with a convolutional layer to better preserve local spatial relationships and introducing a spatial normalization layer to enhance spatial contrast [20][21][22]. Group 3: Performance Improvements - iREPA demonstrated significant improvements in convergence speed across various diffusion transformers and visual encoders, proving its robustness and general applicability [26][27]. - The method showed that as the model size increases, the performance gains from iREPA also increase, aligning with the "Scaling Law" trend [34]. - Visual quality improvements were evident, with iREPA-generated images exhibiting better object outlines, texture details, and overall structural coherence compared to standard REPA [36]. Group 4: Conclusion - The research emphasizes that understanding spatial relationships between pixels is more crucial for generative models than merely focusing on a single metric like ImageNet accuracy [42].
NeurIPS 2025 Oral | 1个Token零成本,REG让Diffusion训练收敛快20倍!
机器之心· 2025-11-29 01:49
Core Insights - REG is a simple and effective method that significantly accelerates the training convergence of generative models by introducing a class token, enhancing the performance of diffusion models [2][9][38] Group 1: Methodology - REG combines low-level latent representations with high-level class tokens from pre-trained visual models, allowing for simultaneous noise addition and denoising optimization during training [9][14] - The training process requires only the addition of one token, resulting in a computational overhead of less than 0.5% while not increasing inference costs [9][10][26] - REG achieves a 63x and 23x acceleration in convergence speed compared to SiT-XL/2 and SiT-XL/2+REPA, respectively, on ImageNet 256×256 [10][17] Group 2: Performance Metrics - In terms of FID scores, REG outperforms REPA significantly, achieving a FID of 1.8 after 4 million training steps, while SiT-XL/2+REPA achieves a FID of 5.9 [17][19] - REG shows a reduction in training time by 97.90% compared to SiT-XL/2 while maintaining similar FID scores [24][25] - The inference overhead for REG is minimal, with increases in parameters, FLOPs, and latency being less than 0.5%, while FID scores improve by 56.46% compared to SiT-XL/2 + REPA [26][27] Group 3: Ablation Studies - Extensive ablation studies demonstrate the effectiveness of REG, showing that high-level global discriminative information significantly enhances generation quality [28][30] - The introduction of the DINOv2 class token leads to the best performance in generating quality images, indicating the importance of high-level semantic guidance [30][31] Group 4: Conclusion - Overall, REG represents a highly efficient training paradigm that integrates high-level and low-level token entanglement, promoting a "understanding-generating" decoupling in generative models, leading to superior generation outcomes without increasing inference costs [38]