Workflow
DINOv2
icon
Search documents
聊聊DreamVLA:让机器人先看后想再动
具身智能之心· 2025-08-11 00:14
作者丨 小红师兄 编辑丨具身智能之心 原文链接: https://zhuanlan.zhihu.com/p/1928781468743766758 点击下方 卡片 ,关注" 具身智能之心 "公众号 >> 点击进入→ 具身 智能之心 技术交流群 更多干货,欢迎加入国内首个具身智能全栈学习社区 : 具身智能之心知识星球 (戳我) , 这里包含所有你想要的。 最近读到一篇很不错的论文《DreamVLA: A Vision-Language-Action Model Dreamed with Comprehensive World Knowledge》,提出了一种新的视觉-语言-动作 (VLA)模型,名叫DreamVLA。这个模型的核心在于让机器人不仅能"看"图像、"听"指令,还能通过预测环境的动态、空间和语义信息,做出更精准的动作决 策。 背景:机器人为啥需要"想"得更多? 传统的VLA模型通常直接把视觉输入(比如摄像头拍的画面)和语言指令(比如"把杯子拿过来")映射到动作上。这种方法简单直接,但问题在于,画面里往往 有很多无关信息,机器人可能会被干扰,或者在复杂环境中反应不够灵活。比如,场景里可能有桌子、椅子、杂 ...
港大马毅团队等开源新作:用编码率正则化重构视觉自监督学习范式,“少即是多”
量子位· 2025-03-08 03:35
Core Viewpoint - The article discusses the introduction of SimDINO and SimDINOv2, two new visual pre-training models developed by a collaboration of researchers from various institutions, which simplify the training process of the existing DINO and DINOv2 models while enhancing their performance [1][5][12]. Group 1: Model Development - SimDINO and SimDINOv2 are designed to address the complexities associated with DINO and DINOv2, which are currently leading models in visual pre-training [2][4]. - The new models utilize coding rate regularization to simplify the training process and improve robustness and performance [12][16]. - The core idea is to remove complex empirical design components from the original DINO and DINOv2 training processes, making the models easier to train and implement [12][18]. Group 2: Methodology - The introduction of coding rate regularization helps prevent representation collapse, which was a significant issue in the original models [14][17]. - SimDINO retains the EMA self-distillation scheme and multi-view data augmentation from DINO but modifies the contrastive learning approach to use Euclidean distance or cosine similarity instead of high-dimensional projections [18][19]. - SimDINOv2 further simplifies the iBOT mechanism introduced in DINOv2, enhancing the model's efficiency [19]. Group 3: Experimental Validation - Extensive experiments on various datasets, including ImageNet-1K, COCO val2017, and ADE20K, demonstrate that SimDINO and SimDINOv2 outperform the DINO series in terms of computational efficiency, training stability, and downstream task performance [22][23]. - In specific evaluations, SimDINO achieved a linear segmentation mIoU of 33.7 and mAcc of 42.8, while SimDINOv2 reached mIoU of 36.9 and mAcc of 46.5, showcasing significant improvements over DINO and DINOv2 [30]. Group 4: Theoretical Insights - The research team proposes a theoretical framework for selecting hyperparameters in SimDINO, focusing on balancing the gradients of the coding rate regularization term and the distance term [33][34]. - This theoretical analysis provides a clearer optimization target and reduces the complexity of hyperparameter tuning, making the training process more straightforward [39]. Group 5: Future Directions - The research team suggests potential improvements for SimDINO, including exploring self-supervised objectives that do not require self-distillation optimization [43].