Core Insights - The article discusses the introduction of the Contrastive Latent Action Pretraining (CLAP) framework, which aims to address the data scarcity issue in robot learning by leveraging abundant human behavior videos from platforms like YouTube and Douyin [4][10]. Group 1: CLAP Framework Overview - The CLAP framework aligns the motion space extracted from videos with the action space of robots, effectively avoiding the "visual entanglement" problem commonly faced by existing latent action models [9][11]. - It utilizes a unified Visual-Language-Action (VLA) framework that combines the precision of machine data with the semantic diversity of large-scale unannotated human video demonstrations [14]. Group 2: Training Methodology - The research team developed two VLA modeling paradigms: CLAP-NTP, a self-regressive model excelling in instruction following and object generalization, and CLAP-RF, a strategy based on Rectified Flow aimed at high-frequency, fine-grained control [10][16]. - A knowledge matching (KM) regularization strategy is introduced to mitigate catastrophic forgetting during the fine-tuning process, ensuring that robots retain previously learned skills while acquiring new ones [11][16]. Group 3: Experimental Results - Extensive experiments demonstrate that CLAP significantly outperforms strong baseline methods, enabling effective skill transfer from human videos to robot execution [18]. - Performance comparisons in real-world tasks show that CLAP-NTP and CLAP-RF achieve success rates of 90% and 85% respectively in pick-and-place tasks, indicating superior capabilities [20]. - Robustness evaluations reveal that CLAP-RF maintains a mean success rate of 66.7% under environmental perturbations, showcasing its resilience [21].
打破机器人“数据饥荒”僵局:锦秋被投企业星尘智能联合清华、MIT等发布CLAP框架|Jinqiu Spotlight
锦秋集·2026-01-21 15:36