Workflow
知识迁移
icon
Search documents
让机器人看视频学操作技能,清华等全新发布的CLAP框架做到了
机器之心· 2026-01-19 03:51
Core Insights - The article discusses the introduction of the Contrastive Latent Action Pretraining (CLAP) framework, developed by Tsinghua University in collaboration with Stardust Intelligence, HKU, and MIT, which enables robots to learn skills directly from videos [2][3]. Group 1: Challenges in Robot Learning - The article highlights a long-standing issue in robot learning known as "data scarcity," where there is an abundance of human behavior videos online but a lack of data specifically for training robots [3]. - The root cause of this data asymmetry is the high cost and inefficiency associated with collecting robot operation data, which requires expensive hardware, specialized environments, and extensive manual labeling [3]. - Traditional latent action models face the "visual entanglement" problem, where models learn irrelevant visual noise instead of actual manipulation skills [3]. Group 2: Innovations of the CLAP Framework - The CLAP framework addresses the technical bottleneck of aligning the motion space extracted from videos with the robot's action space, effectively avoiding the visual entanglement issue [3]. - By utilizing contrastive learning, CLAP maps state transitions in videos to a quantifiable, physically executable action codebook [3]. - The framework allows robots to learn skills from vast amounts of video data available on platforms like YouTube and Douyin, significantly expanding the scale of usable training data [4]. Group 3: Training Methodology - The research team trained the CLAP framework using two modeling paradigms: CLAP-NTP, a self-regressive model excelling in instruction following and object generalization, and CLAP-RF, a strategy based on Rectified Flow aimed at high-frequency, precise control [4][10]. - The framework employs a knowledge matching (KM) regularization strategy to mitigate catastrophic forgetting during the fine-tuning process, ensuring that robots retain previously learned skills while acquiring new ones [4][10]. Group 4: Practical Implications - The long-term value of the CLAP framework lies not only in its technical innovation but also in its potential to accelerate the industrialization of robotics by reducing the cost and time required for deploying robots in various sectors such as services and manufacturing [6]. - The unified visual-language-action (VLA) framework allows for the effective integration of the precision of machine data with the semantic diversity of large-scale unannotated human video demonstrations [8]. Group 5: Experimental Results - Extensive experiments demonstrate that CLAP significantly outperforms strong baseline methods, enabling effective skill transfer from human videos to robot execution [12]. - Performance comparisons in real-world tasks show that CLAP-NTP and CLAP-RF achieve higher success rates in various tasks compared to baseline methods, indicating the framework's robustness and effectiveness [14][15].
Meta详细阐述基于LLM级训练、混合并行计算与知识迁移的GEM广告模型
AI前线· 2025-12-28 05:33
Core Insights - Meta has released detailed information about its Generative Advertising Model (GEM), aimed at improving ad recommendation capabilities on its platform by processing billions of user-ad interaction data daily [2] - The model addresses the core challenge in recommendation systems, which is the sparsity of meaningful signals such as clicks and conversions [2] - GEM is designed to learn from diverse advertising data, including advertiser goals, creative formats, measurement signals, and user behavior across multiple channels [2] Model Architecture and Training - Meta has redesigned its training architecture to support GEM at a scale comparable to modern large language models, employing customized multi-dimensional parallel strategies for different model components [4] - Dense model components utilize Hybrid Sharded Distributed Parallel (HSDP) technology to optimize memory usage and reduce communication overhead, while sparse components use a two-dimensional parallel scheme combining data and model parallelism [4] - Several GPU-level optimizations have been implemented to reduce training bottlenecks, including custom GPU kernels for variable-length user sequences and memory compression techniques [4] Efficiency and Knowledge Transfer - The system continuously optimizes GPU efficiency throughout the model lifecycle, with lightweight model variants supporting over half of the experiments at a lower cost [5] - Meta employs two migration strategies to transfer the capabilities of the infrastructure model into measurable benefits for user-facing vertical models: direct migration and hierarchical migration [5][6] - These methods maximize transfer efficiency within Meta's advertising model ecosystem through knowledge distillation, representation learning, and parameter sharing [6] Industry Impact and Future Prospects - The effective floating-point operation performance of GEM has improved by 23 times, which is seen as a key factor in changing economic benefits [8] - The technology is viewed as a game changer for advertisers, potentially saving small businesses significant amounts of money by relying on intelligent models to optimize ad spending [9] - Meta envisions that the foundational model for ad recommendation will evolve to better understand user preferences and intentions, facilitating more personalized interactions between users and ads [10]
FDA对偶锚点:模型知识迁移的新视角——从参数空间到输入空间
机器之心· 2025-11-14 01:33
Core Insights - The article discusses the introduction of FDA (Model Merging with Functional Dual Anchors), a novel framework for merging expert models derived from task-specific fine-tuning of general foundational models, aiming to integrate their capabilities into a single model without accessing original training data [2][4]. Group 1: FDA Framework Overview - FDA represents task knowledge embedded in model parameters using a set of dual synthetic input points, allowing for efficient knowledge integration through induced gradients in the input space [4][10]. - Unlike traditional methods that rely on arithmetic operations in parameter space, FDA shifts the knowledge integration process to the input space, providing a new perspective on model merging [4][9]. - The framework is designed to be scalable to large neural network models, demonstrating superior performance and scalability in visual and natural language models [4][12]. Group 2: Performance and Robustness - Experimental results indicate that FDA significantly outperforms traditional task vector methods, achieving an average performance of 87.26 in multi-task scenarios compared to 73.94 for task vectors, marking an improvement of nearly 18% [14]. - FDA exhibits flexible knowledge modeling capabilities, achieving an average performance increase of approximately 5.10% on ViT-B/16 and about 13% on RoBERTa-Large, showcasing its adaptability across different architectures [15]. Group 3: Algorithm Implementation - The FDA algorithm consists of two main phases: construction of FDA samples for each downstream task and parameter updates based on the constructed FDA [17][19]. - Two practical initialization strategies for FDA construction are proposed: linear weight sampling and scaled Gaussian sampling, which help in optimizing the initial points effectively [18]. Group 4: Knowledge Encoding and Mechanisms - FDA captures task-related dominant representation directions while suppressing redundant or noisy components, aligning with the low-rank structure typically observed in task-specific knowledge in parameter space [22]. - The optimization process of FDA aligns its high-energy subspace with the high-energy subspace of real data, indicating a potential connection between the knowledge encoded in FDA and the actual task data [23]. - The parameter updates induced by FDA gradually align with those induced by real data, demonstrating its robustness and effectiveness in capturing task-related knowledge [24].
世界人工智能大会,AI教父Hinton告诉你的25个道理
3 6 Ke· 2025-07-29 23:58
Core Insights - Geoffrey Hinton, a prominent figure in AI, discussed the evolution of AI from symbolic reasoning to neural networks at the WAIC 2025, emphasizing the importance of understanding language through large language models (LLMs) [1][2][10] Group 1: Evolution of AI Understanding - For over 60 years, there have been two paradigms in AI: the logical heuristic paradigm focusing on symbolic reasoning and the biological paradigm emphasizing neural network learning [1] - Hinton's early model in 1985 aimed to merge these theories by predicting the next word based on features, which laid the groundwork for modern LLMs [2] - The development of LLMs has evolved from Hinton's initial models to more complex structures capable of processing vast amounts of input and creating intricate relationships [2][3] Group 2: Mechanism of Language Understanding - LLMs and human language understanding share similarities, converting language into features and integrating them across neural network layers for semantic comprehension [3] - Hinton uses the analogy of LEGO blocks to describe how words can be combined to form complex semantic structures, highlighting the flexible nature of language [3][4] - Understanding language is compared to deconstructing a protein molecule rather than creating a clear logical expression [3] Group 3: Knowledge Transfer and Collaboration - Knowledge transfer in humans is inefficient, often relying on explanations, while digital intelligence can share vast amounts of information directly [5][6] - Current technology allows for efficient knowledge migration and collaborative learning across different hardware setups, enhancing the capabilities of models like GPT-4 [6][7] - If independent intelligent agents can share weights and gradients, they can effectively exchange learned knowledge, leading to significant advancements [6][7] Group 4: AI's Future and Global Cooperation - Hinton warns of the potential dangers of AI surpassing human intelligence, emphasizing the need for control and ethical considerations in AI development [7][10] - The necessity for global cooperation in AI governance is highlighted, with a call for an international organization to ensure AI develops positively [8][9] - Hinton believes that the challenge of ensuring AI remains beneficial to humanity is one of the most critical issues of the era, requiring collective efforts [9][10]
世界人工智能大会,AI教父Hinton告诉你的25个道理
混沌学园· 2025-07-29 12:04
Core Viewpoint - The article discusses Geoffrey Hinton's insights on the relationship between AI and human intelligence, emphasizing the evolution of AI from symbolic reasoning to large language models (LLMs) and the implications of AI surpassing human intelligence [1][10]. Group 1: Evolution of AI Understanding - For over 60 years, there have been two distinct paradigms in AI: the logical inference paradigm, which views intelligence as symbolic reasoning, and the biological paradigm, which sees intelligence as rooted in understanding and learning through neural networks [1]. - In 1985, Hinton created a small model to explore how humans understand vocabulary by linking features of words to predict the next word without storing entire sentences [2]. - The development of LLMs is seen as a continuation of Hinton's early work, processing more input words and utilizing complex neural structures to build richer interactions [3]. Group 2: Mechanism of Language Understanding - LLMs and human language understanding mechanisms are highly similar, transforming language into features and integrating these features across neural network layers for semantic understanding [4]. - Each word in language is likened to a multi-dimensional Lego block, which can flexibly combine to form complex semantic structures, with the shape of words adapting based on context [6]. - Understanding a sentence is compared to deconstructing a protein molecule rather than converting it into a clear, unambiguous logical expression [5]. Group 3: Knowledge Transfer in AI - The human brain operates at 300,000 watts but cannot easily transfer knowledge to another person, relying instead on explanation [11]. - In contrast, digital intelligence allows for efficient knowledge transfer, directly copying parameters and structures without intermediary language, sharing trillions of bits of information during synchronization [13][14]. - Current technology enables the same model to be deployed across different hardware, facilitating efficient knowledge migration and collaborative learning [15]. Group 4: The Dangers of Advanced AI - There is a concern that AI could surpass human intelligence, leading to scenarios where AI becomes an active system with its own goals, potentially manipulating humans [18][19]. - Hinton warns that developing AI is akin to raising a tiger; once it grows powerful, losing control could be fatal [20]. - Despite the risks, AI holds significant value in various fields, and eliminating it is not feasible; instead, a method must be found to ensure AI does not threaten humanity [21]. Group 5: Global Cooperation for AI Safety - No single country desires AI to dominate the world, and if one country discovers a method to prevent AI from going rogue, others will likely follow suit [22][23]. - Hinton proposes the establishment of an international AI safety organization to research technology and create standards to ensure AI develops positively [24]. - The long-term challenge is to ensure that AI remains a supportive tool for humanity rather than a ruler, which is a critical issue for global collaboration [25].