Workflow
对比学习
icon
Search documents
打破机器人“数据饥荒”僵局:锦秋被投企业星尘智能联合清华、MIT等发布CLAP框架|Jinqiu Spotlight
锦秋集· 2026-01-21 15:36
Core Insights - The article discusses the introduction of the Contrastive Latent Action Pretraining (CLAP) framework, which aims to address the data scarcity issue in robot learning by leveraging abundant human behavior videos from platforms like YouTube and Douyin [4][10]. Group 1: CLAP Framework Overview - The CLAP framework aligns the motion space extracted from videos with the action space of robots, effectively avoiding the "visual entanglement" problem commonly faced by existing latent action models [9][11]. - It utilizes a unified Visual-Language-Action (VLA) framework that combines the precision of machine data with the semantic diversity of large-scale unannotated human video demonstrations [14]. Group 2: Training Methodology - The research team developed two VLA modeling paradigms: CLAP-NTP, a self-regressive model excelling in instruction following and object generalization, and CLAP-RF, a strategy based on Rectified Flow aimed at high-frequency, fine-grained control [10][16]. - A knowledge matching (KM) regularization strategy is introduced to mitigate catastrophic forgetting during the fine-tuning process, ensuring that robots retain previously learned skills while acquiring new ones [11][16]. Group 3: Experimental Results - Extensive experiments demonstrate that CLAP significantly outperforms strong baseline methods, enabling effective skill transfer from human videos to robot execution [18]. - Performance comparisons in real-world tasks show that CLAP-NTP and CLAP-RF achieve success rates of 90% and 85% respectively in pick-and-place tasks, indicating superior capabilities [20]. - Robustness evaluations reveal that CLAP-RF maintains a mean success rate of 66.7% under environmental perturbations, showcasing its resilience [21].
让机器人看视频学操作技能,清华等全新发布的CLAP框架做到了
机器之心· 2026-01-19 03:51
Core Insights - The article discusses the introduction of the Contrastive Latent Action Pretraining (CLAP) framework, developed by Tsinghua University in collaboration with Stardust Intelligence, HKU, and MIT, which enables robots to learn skills directly from videos [2][3]. Group 1: Challenges in Robot Learning - The article highlights a long-standing issue in robot learning known as "data scarcity," where there is an abundance of human behavior videos online but a lack of data specifically for training robots [3]. - The root cause of this data asymmetry is the high cost and inefficiency associated with collecting robot operation data, which requires expensive hardware, specialized environments, and extensive manual labeling [3]. - Traditional latent action models face the "visual entanglement" problem, where models learn irrelevant visual noise instead of actual manipulation skills [3]. Group 2: Innovations of the CLAP Framework - The CLAP framework addresses the technical bottleneck of aligning the motion space extracted from videos with the robot's action space, effectively avoiding the visual entanglement issue [3]. - By utilizing contrastive learning, CLAP maps state transitions in videos to a quantifiable, physically executable action codebook [3]. - The framework allows robots to learn skills from vast amounts of video data available on platforms like YouTube and Douyin, significantly expanding the scale of usable training data [4]. Group 3: Training Methodology - The research team trained the CLAP framework using two modeling paradigms: CLAP-NTP, a self-regressive model excelling in instruction following and object generalization, and CLAP-RF, a strategy based on Rectified Flow aimed at high-frequency, precise control [4][10]. - The framework employs a knowledge matching (KM) regularization strategy to mitigate catastrophic forgetting during the fine-tuning process, ensuring that robots retain previously learned skills while acquiring new ones [4][10]. Group 4: Practical Implications - The long-term value of the CLAP framework lies not only in its technical innovation but also in its potential to accelerate the industrialization of robotics by reducing the cost and time required for deploying robots in various sectors such as services and manufacturing [6]. - The unified visual-language-action (VLA) framework allows for the effective integration of the precision of machine data with the semantic diversity of large-scale unannotated human video demonstrations [8]. Group 5: Experimental Results - Extensive experiments demonstrate that CLAP significantly outperforms strong baseline methods, enabling effective skill transfer from human videos to robot execution [12]. - Performance comparisons in real-world tasks show that CLAP-NTP and CLAP-RF achieve higher success rates in various tasks compared to baseline methods, indicating the framework's robustness and effectiveness [14][15].
Nature子刊:王珊珊/张康合作开发新型AI模型,让AI自主找病灶,无需医生手动标注
生物世界· 2026-01-10 03:06
Core Viewpoint - The research introduces a novel multimodal vision-language model named AFLoc (Annotation-Free pathology Localization), which enables automatic localization of pathologies in medical images without the need for prior annotations from doctors, showcasing strong generalization capabilities that surpass human benchmarks in various pathology image tasks [4][9]. Group 1 - AFLoc is designed to perform pathology localization without requiring annotations, thus reducing the dependency on expert input [4][10]. - The model utilizes a contrastive learning approach based on a multi-level semantic structure, aligning diverse medical concepts with rich image features to adapt to the varied manifestations of pathologies [7][9]. - Initial experiments were conducted on a dataset of 220,000 pairs of chest X-ray images and reports, demonstrating that AFLoc outperforms current state-of-the-art methods in both localization and classification tasks without annotations [9]. Group 2 - The research validates AFLoc's generalization capabilities across different modalities, including histopathology and retinal fundus images, indicating its robustness in diverse clinical environments [9][10]. - The findings highlight AFLoc's potential to lower annotation requirements and adapt to complex clinical applications, making it a significant advancement in the field of medical imaging [10].
DiffusionDriveV2核心代码解析
自动驾驶之心· 2025-12-28 09:23
Core Viewpoint - The article discusses the DiffusionDrive model, which utilizes a truncated diffusion approach for end-to-end autonomous driving, emphasizing its architecture and the integration of reinforcement learning to enhance trajectory planning and safety [1]. Group 1: Model Architecture - DiffusionDriveV2 employs a reinforcement learning-constrained truncated diffusion model, focusing on the overall architecture for autonomous driving [3]. - The model incorporates environment encoding, including bird's-eye view (BEV) features and vehicle status, to enhance the understanding of the driving context [5]. - The trajectory planning module utilizes multi-scale BEV features to improve the accuracy of trajectory predictions [8]. Group 2: Trajectory Generation - The model generates trajectories by first clustering the true future trajectories of the vehicle using K-Means to create anchors, which are then perturbed with Gaussian noise [12]. - The trajectory prediction process involves cross-attention mechanisms between the trajectory features and BEV features, allowing for more accurate trajectory generation [15][17]. - The model also integrates time encoding to enhance the temporal aspect of trajectory predictions [14]. Group 3: Reinforcement Learning Integration - The Intra-Anchor GRPO method is proposed to optimize strategies within specific behavior intentions, enhancing safety and goal-oriented trajectory generation [27]. - The reinforcement learning loss function is designed to mitigate instability during early denoising steps, using a discount factor to adjust the influence of rewards over time [28]. - The model incorporates a clear learning signal by truncating negative advantages and applying strong penalties for collisions, ensuring safer trajectory outputs [30]. Group 4: Noise Management - The model introduces multiplicative noise rather than additive noise to maintain the structural integrity of trajectories, ensuring smoother exploration paths [33]. - This approach addresses the inherent scale inconsistencies in trajectory segments, allowing for more coherent and realistic trajectory generation [35]. Group 5: Evaluation Metrics - The model evaluates generated trajectories based on safety, comfort, rule compliance, progress, and feasibility, aggregating these into a comprehensive score [27]. - Specific metrics are employed to assess safety (collision detection), comfort (acceleration and curvature), and adherence to traffic rules, ensuring a holistic evaluation of trajectory performance [27].
Embedding黑箱成为历史!这个新框架让模型“先解释,再学Embedding”
量子位· 2025-10-21 09:05
Core Insights - The article introduces GRACE, a new explainable generative embedding framework developed by researchers from multiple universities, aimed at addressing the limitations of traditional text embedding models [1][6]. Group 1: Background and Limitations - Text embedding models have evolved from BERT to various newer models, mapping text into vector spaces for tasks like semantic retrieval and clustering [3]. - A common flaw in these models is treating large language models as "mute encoders," which output vectors without explaining the similarity between texts [4]. - This black-box representation becomes a bottleneck in tasks requiring high interpretability and robustness, such as question-answer matching and cross-domain retrieval [5]. Group 2: GRACE Framework Overview - GRACE transforms "contrastive learning" into "reinforcement learning," redefining the meaning of contrastive learning signals [6]. - The framework emphasizes generating explanations (rationales) for text before learning embeddings, allowing the model to produce logical and semantically consistent reasoning [7][25]. - GRACE consists of three key modules: 1. Rationale-Generating Policy, which generates explanatory reasoning chains for input texts [8]. 2. Representation Extraction, which combines input and rationale to compute final embeddings [9]. 3. Contrastive Rewards, which redefines contrastive learning objectives as a reward function for reinforcement learning updates [11]. Group 3: Training Process - GRACE can be trained in both supervised and unsupervised manners, utilizing labeled query-document pairs and self-alignment techniques [12][18]. - In the supervised phase, the model learns semantic relationships from a dataset of 1.5 million samples [13]. - The unsupervised phase generates multiple rationales for each text, encouraging consistent representations across different explanations [17]. Group 4: Experimental Results - GRACE was evaluated across 56 datasets in various tasks, showing significant performance improvements over baseline models in retrieval, pair classification, and clustering [19][20]. - The results indicate that GRACE not only enhances embedding capabilities without sacrificing generative abilities but also provides transparent representations that can be understood by users [25][27]. Group 5: Conclusion - Overall, GRACE represents a paradigm shift in embedding models, moving towards a framework that can explain its understanding process, thus enhancing both performance and interpretability [28].
对比学习视角,GRPO即DPO?
自动驾驶之心· 2025-10-18 16:03
Core Insights - The article discusses the development of efficient GRPO (Generalized Reinforcement Policy Optimization) and its implications for reinforcement learning, highlighting the challenges and breakthroughs encountered during the research process [1][2]. Group 1: Research Development - The initial focus was on improving the speed of GRPO, with an emphasis on sampling efficiency, which is a common challenge in reinforcement learning [2][3]. - The author experimented with tree-based sampling methods but found that they did not yield the expected improvements in efficiency [3]. - A second approach involved "speculative sampling," which aimed to exit upon obtaining a correct sample, but faced implementation challenges that hindered performance [3][4]. Group 2: Methodological Innovations - The third approach utilized historical data to estimate the correctness of prompts, leading to a more efficient sampling strategy based on Bayesian methods [4]. - Experiments showed that reducing the number of rollouts per prompt did not significantly impact performance, indicating robustness in the methodology [4][5]. - The exploration of contrastive learning principles led to insights about the relationship between DPO (Direct Policy Optimization) and GRPO, suggesting potential avenues for further research [5]. Group 3: Community and Collaboration - The article emphasizes the importance of community engagement in advancing research, highlighting the role of discussions and collaborations in refining ideas and methodologies [8][10]. - The establishment of a comprehensive community focused on large model technologies aims to facilitate knowledge sharing and collaboration across various domains, including academic research and practical applications [9][10].
攻克结构化长文档检索难题!新框架让模型告别“结构性失明”
量子位· 2025-09-25 11:42
Core Insights - The article introduces SEAL (Structure and Element Aware Learning), a new contrastive learning framework designed to enhance the understanding of long documents by models through structural awareness and element alignment [1][8]. Group 1: SEAL Framework Overview - SEAL innovatively integrates both the macro-level structure and micro-level semantic elements of documents into a unified embedding space, significantly improving pre-trained language models' ability to understand and represent structured data [3]. - The framework addresses two main challenges in long document retrieval: how to make models aware of document hierarchy and how to promote precise alignment between user queries and specific document elements [18] [25]. Group 2: Training Strategies - The framework employs two complementary training strategies: Structure Aware Learning (SAL) and Element Aware Learning (EAL) [9]. - SAL focuses on understanding the "skeleton" of documents by presenting models with two versions of a document—one with structural tags and one without, encouraging the model to learn the inherent structural functions of text segments [12][13]. - EAL enhances the model's grasp of local elements' semantic roles by introducing a masking mechanism, requiring the model to infer overall document relevance based on incomplete information [14][15]. Group 3: Experimental Results - The application of the SEAL framework led to a notable improvement in the BGE-M3 model's retrieval ranking quality, with the MRR@10 metric increasing from 73.96% to 77.84% [17][19]. - The results indicate enhanced capability in ranking more relevant results higher, validated by online A/B testing [20]. Group 4: Open Source Dataset - The team released a new dataset named StructDocRetrieval, containing long documents with structural annotations, significantly surpassing typical short datasets like MS MARCO [21][22]. - This dataset, utilizing HTML format, provides rich structural semantic annotations, filling a gap in the field [23]. Group 5: Broader Implications - The SEAL method's refined understanding of structural information can provide more reliable information sources for downstream tasks, such as aiding AI assistants in accurately locating technical document answers [25]. - The framework shows promising applications in specialized fields like enterprise knowledge bases and legal technology [25].
何恺明改进了谢赛宁的REPA:极大简化但性能依旧强悍
机器之心· 2025-06-12 09:57
Core Viewpoint - The article discusses the significance of representation learning in generative models, particularly through the introduction of a new method called Dispersive Loss, which integrates self-supervised learning into diffusion-based generative models without requiring additional pre-training or external data sources [6][9][43]. Group 1: Diffusion Models and Representation Learning - Diffusion models excel in modeling complex data distributions but are largely disconnected from the representation learning field [2]. - The training objectives of diffusion models typically focus on reconstruction tasks, such as denoising, lacking explicit regularization for learned representations [3]. - Representation learning, particularly self-supervised learning, is crucial for learning general representations applicable to various downstream tasks [4]. Group 2: Introduction of Dispersive Loss - Dispersive Loss is a flexible and general plug-in regularizer that integrates self-supervised learning into diffusion-based generative models [9]. - The core idea of Dispersive Loss is to introduce a regularization target for the model's internal representations, encouraging them to spread out in the latent space [10][13]. - This method does not require additional layers or parameters, making it a simple and independent approach [15][16]. Group 3: Comparison with Existing Methods - Dispersive Loss operates without the need for pre-training, external data, or additional model parameters, unlike the REPA method, which relies on pre-trained models [7][41][43]. - The new method demonstrates that representation learning can benefit generative modeling without external information sources [13][43]. - In practical applications, introducing Dispersive Loss requires minimal adjustments, such as specifying the intermediate layers for regularization [29]. Group 4: Performance Evaluation - Experimental results show that Dispersive Loss consistently outperforms corresponding contrastive losses while avoiding the complexities of dual-view sampling [33]. - The method has been tested across various models, including DiT and SiT, showing improvements in all scenarios, particularly in larger models where effective regularization is crucial [36][37]. - The article highlights that Dispersive Loss can be generalized for one-step diffusion-based generative models, indicating its versatility [44].