Workflow
CLAP框架
icon
Search documents
打破机器人“数据饥荒”僵局:锦秋被投企业星尘智能联合清华、MIT等发布CLAP框架|Jinqiu Spotlight
锦秋集· 2026-01-21 15:36
Core Insights - The article discusses the introduction of the Contrastive Latent Action Pretraining (CLAP) framework, which aims to address the data scarcity issue in robot learning by leveraging abundant human behavior videos from platforms like YouTube and Douyin [4][10]. Group 1: CLAP Framework Overview - The CLAP framework aligns the motion space extracted from videos with the action space of robots, effectively avoiding the "visual entanglement" problem commonly faced by existing latent action models [9][11]. - It utilizes a unified Visual-Language-Action (VLA) framework that combines the precision of machine data with the semantic diversity of large-scale unannotated human video demonstrations [14]. Group 2: Training Methodology - The research team developed two VLA modeling paradigms: CLAP-NTP, a self-regressive model excelling in instruction following and object generalization, and CLAP-RF, a strategy based on Rectified Flow aimed at high-frequency, fine-grained control [10][16]. - A knowledge matching (KM) regularization strategy is introduced to mitigate catastrophic forgetting during the fine-tuning process, ensuring that robots retain previously learned skills while acquiring new ones [11][16]. Group 3: Experimental Results - Extensive experiments demonstrate that CLAP significantly outperforms strong baseline methods, enabling effective skill transfer from human videos to robot execution [18]. - Performance comparisons in real-world tasks show that CLAP-NTP and CLAP-RF achieve success rates of 90% and 85% respectively in pick-and-place tasks, indicating superior capabilities [20]. - Robustness evaluations reveal that CLAP-RF maintains a mean success rate of 66.7% under environmental perturbations, showcasing its resilience [21].
星尘智能x清华x MIT发布CLAP框架!让机器人看视频学操作技能
具身智能之心· 2026-01-20 00:33
点击下方 卡片 ,关注" 具身智能 之心 "公众号 编辑丨 具身智能之心 本文只做学术分享,如有侵权,联系删文 >> 点击进入→ 具身智能之心 技术交流群 更多干货,欢迎加入国内首个具身智能全栈学习社区 : 具身智能之心知识星球 (戳我) , 这里包含所有你想要的。 近日, 星尘智能与清华、港大、MIT 联合提出基于对比学习的隐空间动作预训练(Contrastive Latent Action Pretraining, CLAP)框架。 这个框架能够将视频中提纯的运动空间与机器人的动作空间进行对齐,也就是说,机器人能够直接从视频中学习技能! 论文地址 :https://arxiv.org/abs/2601.04061 长期以来,机器人学习面临着一个令人头疼的"数据饥荒"难题:互联网上有着数以亿计的人类行为视频,但专门用于训练机器人的数据却寥寥无几。这种数据不对 称现象的根源在于,收集机器人操作数据需要昂贵的硬件设备、专业的操作环境,以及大量的人工标注工作,成本高昂且效率低下。相比之下,人类行为视频数据 虽然丰富,但由于视觉表征与机器人动作空间之间存在巨大的语义鸿沟,传统方法难以有效利用这些资源。 现有的潜在动 ...
让机器人看视频学操作技能,清华等全新发布的CLAP框架做到了
机器之心· 2026-01-19 03:51
Core Insights - The article discusses the introduction of the Contrastive Latent Action Pretraining (CLAP) framework, developed by Tsinghua University in collaboration with Stardust Intelligence, HKU, and MIT, which enables robots to learn skills directly from videos [2][3]. Group 1: Challenges in Robot Learning - The article highlights a long-standing issue in robot learning known as "data scarcity," where there is an abundance of human behavior videos online but a lack of data specifically for training robots [3]. - The root cause of this data asymmetry is the high cost and inefficiency associated with collecting robot operation data, which requires expensive hardware, specialized environments, and extensive manual labeling [3]. - Traditional latent action models face the "visual entanglement" problem, where models learn irrelevant visual noise instead of actual manipulation skills [3]. Group 2: Innovations of the CLAP Framework - The CLAP framework addresses the technical bottleneck of aligning the motion space extracted from videos with the robot's action space, effectively avoiding the visual entanglement issue [3]. - By utilizing contrastive learning, CLAP maps state transitions in videos to a quantifiable, physically executable action codebook [3]. - The framework allows robots to learn skills from vast amounts of video data available on platforms like YouTube and Douyin, significantly expanding the scale of usable training data [4]. Group 3: Training Methodology - The research team trained the CLAP framework using two modeling paradigms: CLAP-NTP, a self-regressive model excelling in instruction following and object generalization, and CLAP-RF, a strategy based on Rectified Flow aimed at high-frequency, precise control [4][10]. - The framework employs a knowledge matching (KM) regularization strategy to mitigate catastrophic forgetting during the fine-tuning process, ensuring that robots retain previously learned skills while acquiring new ones [4][10]. Group 4: Practical Implications - The long-term value of the CLAP framework lies not only in its technical innovation but also in its potential to accelerate the industrialization of robotics by reducing the cost and time required for deploying robots in various sectors such as services and manufacturing [6]. - The unified visual-language-action (VLA) framework allows for the effective integration of the precision of machine data with the semantic diversity of large-scale unannotated human video demonstrations [8]. Group 5: Experimental Results - Extensive experiments demonstrate that CLAP significantly outperforms strong baseline methods, enabling effective skill transfer from human videos to robot execution [12]. - Performance comparisons in real-world tasks show that CLAP-NTP and CLAP-RF achieve higher success rates in various tasks compared to baseline methods, indicating the framework's robustness and effectiveness [14][15].