Workflow
DINO
icon
Search documents
谢赛宁新作:VAE退役,RAE当立
量子位· 2025-10-14 08:16
时令 发自 凹非寺 量子位 | 公众号 QbitAI 昔日风光无限的VAE,终于被宣判"退役"? 谢赛宁团队 最新研究给出了答案—— VAE的时代结束,RAE将接力前行 。 其中表征自编码器RAE(Representation Autoencoders)是一种用于扩散Transformer(DiT)训练的新型自动编码器,其核心设计是用预 训练的表征编码器(如DINO、SigLIP、MAE 等)与训练后的轻量级解码器配对,从而替代传统扩散模型中依赖的VAE(变分自动编码 器)。 这种新结构不仅能提供高质量重建结果,还具备语义丰富的潜空间,同时支持可扩展的基于变换器的架构。 该方法在无需额外表示对齐损失的情况下,实现了更快的收敛速度。通过采用配备轻量级宽型DDT头部的DiT变体,他们在ImageNet上取得 强劲的图像生成效果: 下面具体来看。 VAE退役,RAE当立 如今,Diffusion Transformer虽已取得长足发展,但多数模型仍依赖2021年的旧版SD-VAE构建潜空间。 这引发了几大核心问题: 256×256分辨率下,无引导(no guidance)FID= 1.51; 256×256和512 ...
没PhD,算什么AI研究员,LeCun论文竟要28岁辍学生审批,发文“暗讽”内讧升级
3 6 Ke· 2025-09-05 03:44
Core Viewpoint - The internal conflict at Meta regarding AI research and leadership dynamics has intensified, particularly between Chief Scientist Yann LeCun and newly appointed Chief AI Officer Alexandr Wang, highlighting differing views on the role and standards of AI researchers versus engineers [1][3][15]. Group 1: Internal Dynamics - LeCun's recent post suggests a critique of Wang's qualifications and approach, emphasizing that true AI researchers should have a PhD, publish papers, and contribute to open-source projects [2][3][15]. - The restructuring of Meta's AI teams has led to concerns that Wang's TBD Lab will oversee and influence the research output of LeCun's FAIR, blurring the lines between engineering and research [13][23]. - LeCun's position at Meta appears precarious, as he must now report to the younger Wang and seek approval for his publications, which he views as a threat to the independence of FAIR [3][19][23]. Group 2: Academic Standards and Achievements - LeCun, a Turing Award winner and a prominent figure in AI, has a significant academic record with over 80 papers published since 2022 and a citation count exceeding 424,000, contrasting sharply with Wang's limited academic output [8][9][21]. - Wang, despite being a successful entrepreneur and the youngest self-made billionaire, lacks a PhD and has only a handful of publications with a citation count of 409, raising questions about his authority in a research-driven environment [6][7][8]. Group 3: Strategic Implications - The ongoing conflict reflects broader strategic challenges for Meta as it seeks to compete in the AGI space against companies like OpenAI and Google, prioritizing rapid product development over long-term academic research [19][23]. - LeCun's vision for AI research emphasizes the need for new paradigms rather than just scaling existing models, which contrasts with Wang's focus on immediate results and product implementation [17][19]. - The shifting priorities within Meta's AI strategy have led to concerns about the future of open research and the potential departure of key figures like LeCun, who may seek opportunities outside the company [23][24].
Meta视觉基座DINOv3王者归来:自监督首次全面超越弱监督,商用开源
机器之心· 2025-08-15 03:29
Core Viewpoint - The article discusses the advancements in computer vision, particularly focusing on the development and capabilities of the DINO series of models, emphasizing the transition from supervised to self-supervised learning paradigms in AI [2][15][29]. Group 1: DINO Model Evolution - DINO, DINOv2, and DINOv3 represent significant milestones in self-supervised learning, with DINOv3 achieving state-of-the-art performance across various tasks without the need for labeled data [2][15][31]. - DINOv3 has expanded its training dataset to 1.7 billion images and model parameters to 7 billion, significantly enhancing its capabilities compared to its predecessors [9][31][36]. - The introduction of innovative techniques in DINOv3, such as Gram Anchoring and RoPE, has improved the model's ability to generate high-resolution dense features, addressing limitations seen in DINOv2 [18][24][28]. Group 2: Performance Metrics - DINOv3 outperforms previous models in multiple benchmarks, achieving a segmentation score of 55.9, depth estimation of 0.309, and video tracking accuracy of 83.3, showcasing its superior performance in dense prediction tasks [17][31]. - The model's performance in image classification tasks is also notable, with an accuracy of 90.4 on ImageNet ReaL, indicating its robustness across various applications [17][31]. Group 3: Practical Applications - DINOv3 is being utilized in real-world applications, such as analyzing satellite images for environmental monitoring and supporting climate finance processes, demonstrating its practical impact [39][40]. - The model's ability to operate effectively without fine-tuning makes it suitable for edge applications where multiple visual prediction tasks need to be executed simultaneously [34][36]. Group 4: Community Engagement and Accessibility - Meta has open-sourced DINOv3, providing a complete backbone network and evaluation heads for community use, facilitating further research and development [13][36]. - The model family includes various distilled versions to cater to different computational needs, ensuring accessibility for researchers and developers [36][37].
港大马毅团队等开源新作:用编码率正则化重构视觉自监督学习范式,“少即是多”
量子位· 2025-03-08 03:35
Core Viewpoint - The article discusses the introduction of SimDINO and SimDINOv2, two new visual pre-training models developed by a collaboration of researchers from various institutions, which simplify the training process of the existing DINO and DINOv2 models while enhancing their performance [1][5][12]. Group 1: Model Development - SimDINO and SimDINOv2 are designed to address the complexities associated with DINO and DINOv2, which are currently leading models in visual pre-training [2][4]. - The new models utilize coding rate regularization to simplify the training process and improve robustness and performance [12][16]. - The core idea is to remove complex empirical design components from the original DINO and DINOv2 training processes, making the models easier to train and implement [12][18]. Group 2: Methodology - The introduction of coding rate regularization helps prevent representation collapse, which was a significant issue in the original models [14][17]. - SimDINO retains the EMA self-distillation scheme and multi-view data augmentation from DINO but modifies the contrastive learning approach to use Euclidean distance or cosine similarity instead of high-dimensional projections [18][19]. - SimDINOv2 further simplifies the iBOT mechanism introduced in DINOv2, enhancing the model's efficiency [19]. Group 3: Experimental Validation - Extensive experiments on various datasets, including ImageNet-1K, COCO val2017, and ADE20K, demonstrate that SimDINO and SimDINOv2 outperform the DINO series in terms of computational efficiency, training stability, and downstream task performance [22][23]. - In specific evaluations, SimDINO achieved a linear segmentation mIoU of 33.7 and mAcc of 42.8, while SimDINOv2 reached mIoU of 36.9 and mAcc of 46.5, showcasing significant improvements over DINO and DINOv2 [30]. Group 4: Theoretical Insights - The research team proposes a theoretical framework for selecting hyperparameters in SimDINO, focusing on balancing the gradients of the coding rate regularization term and the distance term [33][34]. - This theoretical analysis provides a clearer optimization target and reduces the complexity of hyperparameter tuning, making the training process more straightforward [39]. Group 5: Future Directions - The research team suggests potential improvements for SimDINO, including exploring self-supervised objectives that do not require self-distillation optimization [43].