Workflow
扩散策略
icon
Search documents
从纯小白到具身算法工程师的打怪之路
具身智能之心· 2025-11-20 04:02
今天有个老学员,拿到了某头部的offer,自笑到从纯小白到算法工程师的打怪之路着实不简单,但真的有 门路。从自己购买so-100折腾,到后面跟着系统的路线一起学习,不仅节省了很多时间,也避免陷入了较 多的坑里。 这里也为大家推荐几个具身方向的研究路线:涉及vla、vln、diffusion policy、强化学习等。也欢迎扫码直 接学习: vla方向 VLA构成的机器人系统主要包括:视觉的感知处理模块,语言指令的理解以及生成机器人可执行动作的策 略网络。根据不同的需求,目前的VLA主要分为三类范式:显示端到到VLA,隐式端到端VLA以及分层端 到端VLA。 显示端到到VLA,是最常见最经典的范式。通常是将视觉语言信息压缩成联合的表征,然后再基于这个表 征去重新映射到动作空间,生成对应的动作。这类端到端的范式依赖于先前广泛的研究先验,通过不同架 构(diffusion/ transformer/dit),不同的模型大小,不同的应用场景(2d/3d),不同的任务需求(从头训/下 游微调),产生了各类不同的方案,取得了不错的性能。 隐式端到端VLA,则不同于前者,更加关注工作的可解释性,旨在利用当前的video d ...
Cocos系统:让你的VLA模型实现了更快的收敛速度和更高的成功率
具身智能之心· 2025-08-22 00:04
Core Viewpoint - The article discusses the advancements in embodied intelligence, particularly focusing on diffusion strategies and the introduction of a new method called Cocos, which addresses the issue of loss collapse in training diffusion policies, leading to improved training efficiency and performance [3][11][25]. Summary by Sections Introduction - Embodied intelligence is a cutting-edge field in AI research, emphasizing the need for robots to understand and execute complex tasks effectively. Diffusion policies have emerged as a mainstream paradigm for constructing visual-language-action (VLA) models, although training efficiency remains a challenge [3]. Loss Collapse and Cocos - The article identifies loss collapse as a significant challenge in training diffusion strategies, where the neural network struggles to distinguish between generation conditions, leading to degraded training objectives. Cocos modifies the source distribution to depend on generation conditions, effectively addressing this issue [6][9][25]. Flow Matching Method - Flow matching is a core method in diffusion models, transforming a simple source distribution into a complex target distribution through optimization. The article outlines the optimization objectives for conditional distribution flow matching, which is crucial for VLA models [5][6]. Experimental Results - The article presents quantitative experimental results demonstrating that Cocos significantly enhances training efficiency and strategy performance across various benchmarks, including LIBERO and MetaWorld, as well as real-world robotic tasks [14][16][19][24]. Case Studies - Case studies illustrate the practical applications of Cocos in simulation tasks, highlighting its effectiveness in improving the robot's ability to distinguish between different camera perspectives and successfully complete tasks [18][21]. Source Distribution Design - The article discusses experiments on source distribution design, comparing different standard deviations and training methods. It concludes that a standard deviation of 0.2 is optimal, and using VAE for training the source distribution yields comparable results [22][24]. Conclusion - Cocos provides a general improvement for diffusion strategy training by effectively solving the loss collapse problem, thereby laying a foundation for future research and applications in embodied intelligence [25].
扩散世界模型LaDi-WM大幅提升机器人操作的成功率和跨场景泛化能力
具身智能之心· 2025-08-18 00:07
Core Viewpoint - The article discusses the development of LaDi-WM (Latent Diffusion-based World Models), a novel world model that enhances robotic operation performance through predictive strategies, addressing the challenge of accurately predicting future states in robot-object interactions [1][5][28]. Group 1: LaDi-WM Overview - LaDi-WM utilizes pre-trained vision foundation models to create latent space representations that encompass both geometric and semantic features, facilitating strategy learning and cross-task generalization in robotic operations [1][5][10]. - The framework consists of two main phases: world model learning and policy learning, which iteratively optimizes action outputs based on predicted future states [9][12]. Group 2: Methodology - The world model learning phase involves extracting geometric representations using DINOv2 and semantic representations using Siglip, followed by an interactive diffusion process to enhance dynamic prediction accuracy [10][12]. - The policy model training incorporates future predictions from the world model as additional inputs, guiding the model to improve action predictions and reduce output distribution entropy over iterations [12][22]. Group 3: Experimental Results - In virtual experiments on the LIBERO-LONG dataset, LaDi-WM achieved a success rate of 68.7% with only 10 training trajectories, outperforming previous methods by a significant margin [15][16]. - The framework demonstrated strong performance in the CALVIN D-D dataset, completing tasks with an average length of 3.63, indicating robust capabilities in long-duration tasks [17][21]. - Real-world experiments showed a 20% increase in success rates for tasks such as stacking bowls and drawer operations, validating the effectiveness of LaDi-WM in practical scenarios [25][26]. Group 4: Scalability and Generalization - The scalability experiments indicated that increasing the training data for the world model led to reduced prediction errors and improved policy performance [18][22]. - The generalization capability of the world model was highlighted by its ability to guide policy learning across different environments, achieving better performance than models trained solely in the target environment [20][21].
VLA之外,具身+VA工作汇总
自动驾驶之心· 2025-07-14 10:36
Core Insights - The article focuses on advancements in embodied intelligence and robotic manipulation, highlighting various research projects and methodologies aimed at improving robot learning and performance in real-world tasks [2][3][4]. Group 1: 2025 Research Highlights - Numerous projects are set for 2025, including "Steering Your Diffusion Policy with Latent Space Reinforcement Learning" and "Chain-of-Action: Trajectory Autoregressive Modeling for Robotic Manipulation," which aim to enhance robotic capabilities in manipulation and interaction [2]. - The "BEHAVIOR Robot Suite" aims to streamline real-world whole-body manipulation for everyday household activities, indicating a focus on practical applications of robotic technology [2]. - "You Only Teach Once: Learn One-Shot Bimanual Robotic Manipulation from Video Demonstrations" emphasizes the potential for robots to learn complex tasks from minimal demonstrations, showcasing advancements in imitation learning [2]. Group 2: Methodological Innovations - The article discusses various innovative methodologies such as "Adaptive 3D Scene Representation for Domain Transfer in Imitation Learning," which aims to improve the adaptability of robots in different environments [2]. - "Learning Dexterous In-Hand Manipulation with Multifingered Hands via Visuomotor Diffusion" highlights the focus on enhancing dexterity in robotic hands, crucial for complex manipulation tasks [4]. - "Generalizable Task-Oriented Grasping via Large-Scale Synthetic Data Generation" indicates a trend towards using synthetic data to train robots, which can significantly reduce the need for real-world data collection [7]. Group 3: Future Directions - The research agenda for 2024 and beyond includes projects like "Learning Robotic Manipulation Policies from Point Clouds with Conditional Flow Matching," which suggests a shift towards utilizing advanced data representations for improved learning outcomes [9]. - "Zero-Shot Framework from Image Generation World Model to Robotic Manipulation" indicates a future direction where robots can generalize from visual data without prior specific training, enhancing their versatility [9]. - The emphasis on "Human-to-Robot Data Augmentation for Robot Pre-training from Videos" reflects a growing interest in leveraging human demonstrations to improve robotic learning efficiency [7].