Workflow
自监督学习(SSL)
icon
Search documents
推特吵架吵出篇论文,谢赛宁团队新作iREPA只要3行代码
3 6 Ke· 2025-12-16 09:42
要说真学术,还得看推特。 刚刚,谢赛宁自曝团队新作iREPA其实来自4个多月前的,一次与网友的辩论。 这场短暂的线上辩论虽然以谢赛宁被网友说服告终,但在3个多月后,居然有了意料之外的后续—— 多个团队合作,沿着这一思路写出了一篇完整的论文,而且核心框架仅需3行代码。 致谢部分还感谢了当时参与讨论的网友。 一篇推特引发的学术论文 事情是这样的。 一位网友在8月份表示: 别再痴迷于ImageNet-1K的分类分数了!自监督学习(SSL)模型应该专门为稠密任务(如REPA、VLM等)进行训练,因为这些任务真正依赖 的是patch tokens中的空间和局部信息,而不是[CLS]token所代表的全局分类性能。 (注:稠密任务就是要求模型对图像中的"每一个像素"或"每一个局部区域"都做出预测的计算机视觉任务,这类任务需要精确的空间和局部细节信息,而 不仅仅是全局分类标签) 对于网友的观点,谢赛宁表示: 不,使用patch token并不意味着就是在做稠密任务。VLM和REPA的性能与它们在IN1K上的得分高度相关,而与patch级别的对应关系只有很弱 的关联。这并不是[CLS]token的问题,而是高层语义与低层像素 ...
推特吵架吵出篇论文!谢赛宁团队新作iREPA只要3行代码
量子位· 2025-12-16 05:58
henry 发自 凹非寺 量子位 | 公众号 QbitAI 要说真学术,还得看推特。 刚刚,谢赛宁自曝团队新作 iREPA 其实来自4个多月前的,一次与网友的辩论。 这场短暂的线上辩论虽然以谢赛宁被网友说服告终,但在3个多月后,居然有了意料之外的后续—— 多个团队合作,沿着这一思路写出了一篇完整的论文,而且核心框架仅需3行代码。 致谢部分还感谢了当时参与讨论的网友。 一篇推特引发的学术论文 事情是这样的。 一位网友在8月份表示: 别再痴迷于ImageNet-1K的分类分数了!自监督学习(SSL)模型应该专门为稠密任务(如REPA、VLM等)进行训练,因为这些任务 真正依赖的是patch tokens中的空间和局部信息,而不是[CLS]token所代表的全局分类性能。 $${\cal L}_{\rm Gram}=\left\|{\bf X}_{S}\cdot{\bf X}_{S}^{\top}-{\bf X}_{G}\cdot{\bf X}_{G}^{\top}\right\|_{\rm F}^{2}.\tag{2}$$ $${\cal L}_{\rm Ref}=w_{\rm D}{\cal L}_{\rm D ...
Meta王炸DINOv3:视觉自监督新巅峰!7B模型狂揽多任务SOTA
自动驾驶之心· 2025-08-16 16:04
Core Insights - The article discusses the advancements in self-supervised learning (SSL) with the introduction of DINOv3, which aims to overcome the challenges of data dependency and annotation costs in computer vision [4][9][57] - DINOv3 is positioned as a versatile self-supervised model capable of handling various tasks without the need for fine-tuning, thus enhancing its practical applicability across different fields [57] Group 1: Challenges in Self-Supervised Learning - The development of self-supervised visual models has faced three major bottlenecks: data quality control, dense feature degradation, and limited adaptability to various scenarios [12][13] - DINOv3 aims to address these challenges by creating a robust foundational model that can provide high-quality dense features and adapt to a wide range of applications [12][57] Group 2: Technical Innovations of DINOv3 - DINOv3 incorporates a novel data construction strategy, utilizing a dataset of 1.689 billion images through a layered filtering and mixed sampling approach, which significantly enhances the quality of training data [16][18] - The training process employs fixed hyperparameters and a 7 billion parameter Vision Transformer (ViT), allowing for consistent learning from vast amounts of data without the complications of dynamic scheduling [20][22] - The introduction of Gram Anchoring addresses the issue of dense feature degradation, improving the spatial specificity of local features during training [24][25] Group 3: Performance and Versatility - DINOv3 demonstrates superior performance across various tasks, including segmentation, depth estimation, and 3D matching, surpassing previous self-supervised models and even some supervised models [41][44] - The model's ability to adapt to high-resolution inputs and its multi-modal capabilities, such as text alignment, further enhance its utility in real-world applications [31][36] - DINOv3's family of models caters to diverse deployment needs, from edge devices to high-performance computing, making it suitable for industrial, remote sensing, and medical imaging applications [50][57]