Workflow
CLAP框架
icon
Search documents
打破机器人“数据饥荒”僵局:锦秋被投企业星尘智能联合清华、MIT等发布CLAP框架|Jinqiu Spotlight
锦秋集· 2026-01-21 15:36
Core Insights - The article discusses the introduction of the Contrastive Latent Action Pretraining (CLAP) framework, which aims to address the data scarcity issue in robot learning by leveraging abundant human behavior videos from platforms like YouTube and Douyin [4][10]. Group 1: CLAP Framework Overview - The CLAP framework aligns the motion space extracted from videos with the action space of robots, effectively avoiding the "visual entanglement" problem commonly faced by existing latent action models [9][11]. - It utilizes a unified Visual-Language-Action (VLA) framework that combines the precision of machine data with the semantic diversity of large-scale unannotated human video demonstrations [14]. Group 2: Training Methodology - The research team developed two VLA modeling paradigms: CLAP-NTP, a self-regressive model excelling in instruction following and object generalization, and CLAP-RF, a strategy based on Rectified Flow aimed at high-frequency, fine-grained control [10][16]. - A knowledge matching (KM) regularization strategy is introduced to mitigate catastrophic forgetting during the fine-tuning process, ensuring that robots retain previously learned skills while acquiring new ones [11][16]. Group 3: Experimental Results - Extensive experiments demonstrate that CLAP significantly outperforms strong baseline methods, enabling effective skill transfer from human videos to robot execution [18]. - Performance comparisons in real-world tasks show that CLAP-NTP and CLAP-RF achieve success rates of 90% and 85% respectively in pick-and-place tasks, indicating superior capabilities [20]. - Robustness evaluations reveal that CLAP-RF maintains a mean success rate of 66.7% under environmental perturbations, showcasing its resilience [21].
星尘智能x清华x MIT发布CLAP框架!让机器人看视频学操作技能
具身智能之心· 2026-01-20 00:33
Core Viewpoint - The article discusses the introduction of the Contrastive Latent Action Pretraining (CLAP) framework by Stardust Intelligence in collaboration with Tsinghua University, Hong Kong University, and MIT, which enables robots to learn skills directly from videos, addressing the long-standing data scarcity issue in robot training [2][4]. Summary by Sections Introduction of CLAP Framework - The CLAP framework aligns the motion space extracted from videos with the action space of robots, allowing robots to learn skills from abundant human behavior videos available online [3][4]. Challenges in Robot Learning - Traditional robot learning faces a "data scarcity" problem, where there is an abundance of human behavior videos but a lack of specific training data for robots. This is due to the high costs and inefficiencies associated with collecting robot operation data [3]. Innovations of CLAP Framework - CLAP addresses the "visual entanglement" issue prevalent in existing latent action models, effectively mapping state transitions from videos to a quantifiable, physically executable action codebook [4]. - The framework utilizes two modeling paradigms: CLAP-NTP, which excels in instruction following and object generalization, and CLAP-RF, which focuses on high-frequency, fine-grained control [4][8]. Efficiency and Cost-Effectiveness - The CLAP framework significantly enhances data utilization efficiency, allowing robots to learn from vast amounts of video content on platforms like YouTube and Douyin, thus lowering the barriers to acquiring robotic skills [4]. Knowledge Transfer and Model Performance - CLAP incorporates a Knowledge Matching (KM) regularization strategy to mitigate catastrophic forgetting during model fine-tuning, ensuring that robots retain previously learned skills while acquiring new ones [5]. - Experimental results indicate that CLAP outperforms strong baseline methods, effectively transferring skills learned from human videos to robot execution [12]. Industrial Application Prospects - The long-term value of the CLAP framework lies in its potential to accelerate the industrialization of robotics, reducing costs and deployment times for businesses, which could lead to widespread applications in service and manufacturing sectors [5].
让机器人看视频学操作技能,清华等全新发布的CLAP框架做到了
机器之心· 2026-01-19 03:51
Core Insights - The article discusses the introduction of the Contrastive Latent Action Pretraining (CLAP) framework, developed by Tsinghua University in collaboration with Stardust Intelligence, HKU, and MIT, which enables robots to learn skills directly from videos [2][3]. Group 1: Challenges in Robot Learning - The article highlights a long-standing issue in robot learning known as "data scarcity," where there is an abundance of human behavior videos online but a lack of data specifically for training robots [3]. - The root cause of this data asymmetry is the high cost and inefficiency associated with collecting robot operation data, which requires expensive hardware, specialized environments, and extensive manual labeling [3]. - Traditional latent action models face the "visual entanglement" problem, where models learn irrelevant visual noise instead of actual manipulation skills [3]. Group 2: Innovations of the CLAP Framework - The CLAP framework addresses the technical bottleneck of aligning the motion space extracted from videos with the robot's action space, effectively avoiding the visual entanglement issue [3]. - By utilizing contrastive learning, CLAP maps state transitions in videos to a quantifiable, physically executable action codebook [3]. - The framework allows robots to learn skills from vast amounts of video data available on platforms like YouTube and Douyin, significantly expanding the scale of usable training data [4]. Group 3: Training Methodology - The research team trained the CLAP framework using two modeling paradigms: CLAP-NTP, a self-regressive model excelling in instruction following and object generalization, and CLAP-RF, a strategy based on Rectified Flow aimed at high-frequency, precise control [4][10]. - The framework employs a knowledge matching (KM) regularization strategy to mitigate catastrophic forgetting during the fine-tuning process, ensuring that robots retain previously learned skills while acquiring new ones [4][10]. Group 4: Practical Implications - The long-term value of the CLAP framework lies not only in its technical innovation but also in its potential to accelerate the industrialization of robotics by reducing the cost and time required for deploying robots in various sectors such as services and manufacturing [6]. - The unified visual-language-action (VLA) framework allows for the effective integration of the precision of machine data with the semantic diversity of large-scale unannotated human video demonstrations [8]. Group 5: Experimental Results - Extensive experiments demonstrate that CLAP significantly outperforms strong baseline methods, enabling effective skill transfer from human videos to robot execution [12]. - Performance comparisons in real-world tasks show that CLAP-NTP and CLAP-RF achieve higher success rates in various tasks compared to baseline methods, indicating the framework's robustness and effectiveness [14][15].