视觉 - 语言 - 动作(VLA)模型

Search documents
会自检的VLA!ReflectDrive:更安全更高效scaling的端到端框架(理想&清华)
自动驾驶之心· 2025-09-27 23:33
会自检的ReflectDrive:我的轨迹我做主,安全感拉满! 端到端自动驾驶已成为一个重要且快速发展的研究领域。通过大规模数据集学习类人驾驶策略具有相当大的潜力。但是在多模态性能以及长尾场景, 没有可持续解决问题的框架。如果仅依赖强化学习来加强,那么reward hack又成为了棘手的问题,很难写出一个全面的reward可以适用连续轨迹复杂的 三维空间。所以近年来大语言模型的泛化能力突破让大家看到了希望,是否能够利用模型scaling以及数据scaling去激发模型的泛化性能,也就是vla模 型的兴起。 大家都想利用上vlm的泛化能力,用更少的数据去解决few shot/zero shot的场景。下面是对于目前自动驾驶方案vla方案的痛点分析: 基于上面的描述,可以看出目前迫切需要做到的是L模态和A模态的融合,一种更容易scaling的统一的架构,同时还要做到高效生成。为应对这些挑 战, 理想和清华的团队提出ReflectDrive——一种新型学习框架,通过离散扩散的反思机制实现安全轨迹生成。 我们首先将二维驾驶空间离散化以构 建动作代码本,从而能够通过微调将预训练扩散语言模型用于规划任务。该框架的核心是安 ...
当机器人学会 “模仿” 人类:RynnVLA-001 如何突破操作数据稀缺困境?
具身智能之心· 2025-09-22 00:03
点击下方 卡片 ,关注" 具身智能 之心 "公众号 作者丨 YumingJiang等 编辑丨具身智能之心 >> 点击进入→ 具身智能之心 技术交流群 更多干货,欢迎加入国内首个具身智能全栈学习社区 : 具身智能之心知识星球 (戳我) , 这里包含所有你想要的。 在大语言模型、多模态模型飞速发展的今天,机器人操作领域却始终受困于一个关键难题——大规模高质量操作数据的稀缺。传统机器人数据采集依赖人类远程 操控实体设备记录轨迹,不仅耗力耗时,成本更是居高不下,直接制约了视觉-语言-动作(VLA)模型的进步。 为打破这一僵局,来自阿里巴巴达摩院的团队提出了全新 VLA 模型 RynnVLA-001。该模型另辟蹊径,将目光投向人类演示数据:通过 1200 万条以ego为中心的 人类操作视频,结合两阶段预训练策略,让机器人 "学习" 人类的操作逻辑与动作轨迹。从预测未来操作帧的视觉动态,到关联人类关键点轨迹建立动作映射, 再到引入 ActionVAE 优化机器人动作连贯性,RynnVLA-001 成功架起了 "人类演示" 到 "机器人操作" 的桥梁。 实验显示,在 LeRobot SO100 机械臂上,RynnVLA-0 ...
TrajBooster:首个全身人行操作VLA方案,跨构型解决数据难题(代码全开源)
具身智能之心· 2025-09-18 00:03
Core Insights - The article discusses the TrajBooster framework, which aims to enhance the capabilities of humanoid robots by utilizing a trajectory-centric learning approach, enabling them to perform complex household tasks with minimal training data [2][40]. Group 1: Research Background and Challenges - The development of humanoid robots faces two main challenges: the unique difficulties of maintaining dynamic balance while performing upper body tasks, and the scarcity of high-quality training data necessary for effective VLA model training [3][4]. - Existing methods rely on expensive equipment and expert operators, resulting in limited data sets that do not adequately cover the diverse action spaces required for humanoid robots [4]. Group 2: TrajBooster Framework - TrajBooster utilizes a three-step process: real trajectory extraction, simulation redirection, and dual-stage fine-tuning, allowing for the conversion of extensive wheeled robot data into effective training resources for bipedal robots [5][40]. - The framework significantly reduces the dependency on costly data from similar robot types, enabling zero-shot skill transfer and improving the robustness and generalization of the VLA models [2][5]. Group 3: Methodology - The framework begins with extracting real trajectories from the Agibot-World Beta dataset, which contains over 1 million real robot trajectories, and then maps this data to the Unitree G1 robot's operational space [7][9]. - A hierarchical composite model is employed to decouple control into upper and lower body systems, enhancing the efficiency of whole-body manipulation [11][12]. Group 4: Experimental Results - TrajBooster demonstrated superior performance in various tasks, achieving the lowest position error (2.851 cm) and rotation error (6.231 degrees) in mobile scenarios, validating the advantages of hierarchical training and coordinated online DAgger [27]. - The framework's ability to adapt to unseen tasks was evidenced by its success in a "water transfer" task, which was not included in the training data, showcasing improved generalization capabilities [39][40]. Group 5: Limitations and Future Directions - The current implementation is limited by the precision of the Unitree Dex-3 hand, which only supports simple grasping tasks; future work will focus on integrating dexterous hands with tactile sensing for more complex manipulations [41]. - There is a need to address the visual input discrepancies and expand the framework to include mobile manipulation data, as the current research is primarily focused on static tasks [43][44].
SimpleVLA-RL:突破 VLA 模型训练瓶颈,RL实现端到端在线训练
具身智能之心· 2025-09-15 00:04
Core Insights - The article discusses the development of the SimpleVLA-RL framework, which enhances the training of Visual-Language-Action (VLA) models in robotics through reinforcement learning (RL) techniques, addressing the limitations of traditional supervised fine-tuning (SFT) methods [2][4][30] Group 1: Research Background and Challenges - VLA models are crucial for integrating visual perception, language understanding, and action generation in robotic control, but current training methods face significant challenges, including data scarcity and weak generalization capabilities [2][5] - The breakthrough in large reasoning models suggests that RL can improve the sequential action planning capabilities of VLA models, but traditional RL methods are limited by manual reward design and the high cost of environmental interactions [2][5] Group 2: Contributions of SimpleVLA-RL - SimpleVLA-RL is designed specifically for VLA, incorporating interactive trajectory sampling and multi-environment parallel rendering, which significantly reduces training costs and improves scalability [6][9] - The framework has achieved state-of-the-art (SOTA) performance across multiple benchmarks, with notable improvements in success rates, such as LIBERO's average success rate increasing from 91.0% to 99.1% [6][12] - SimpleVLA-RL demonstrates enhanced data efficiency, achieving a LIBERO average success rate of 96.9% with only one demonstration trajectory, surpassing traditional methods [16][17] Group 3: Generalization and Real-World Application - The framework shows robust generalization capabilities across unseen tasks, with significant performance improvements in various scenarios, indicating its ability to learn universal skills rather than overfitting to specific data [22][30] - SimpleVLA-RL has proven effective in sim-to-real transfer, with real-world task success rates improving from 17.5% to 38.5%, validating its deployment capabilities [7][21] Group 4: Key Discoveries - The framework has led to the discovery of the "Pushcut" phenomenon, where the RL-trained model autonomously develops more efficient strategies beyond human demonstrations, showcasing the potential for innovative robotic behaviors [24][30] - The effectiveness of SimpleVLA-RL is contingent on the initial model capabilities, with significant performance enhancements observed when starting from a higher baseline success rate [28][29]
机器人入职洗衣房,开始打工挣钱!苹果前AI高管打造
量子位· 2025-09-14 05:05
Core Viewpoint - The article discusses the introduction of a laundry folding robot named Isaacs, developed by Weave Robotic, which is designed to automate the labor-intensive task of folding clothes in laundromats, marking a significant advancement in household robotics [1][3][4]. Group 1: Company Overview - Weave Robotic was founded by former Apple team members, indicating a strong background in technology and innovation [4][15]. - The company has successfully completed three rounds of financing even before the official product launch, showcasing investor confidence in its potential [4]. Group 2: Technology and Functionality - Isaacs is not just a folding robot; it is a versatile household robot capable of performing various tasks, including organizing items and home security in the future [12][14]. - The robot operates based on a three-tiered technological framework, which includes a self-trained visual-language-action (VLA) model for precise identification and folding of clothes [10][18]. - Isaacs can achieve 70% autonomous folding with human intervention only when necessary, demonstrating its advanced capabilities [18]. Group 3: Operational Process - The operational workflow begins with the laundromat handling washing and drying, followed by Isaacs taking over the folding process, which is labor-intensive and requires a certain level of neatness [5][8]. - Specific standards for folding are established, ensuring that items like shirts are folded uniformly and neatly, with attention to details such as collar orientation [6][7]. Group 4: Future Prospects - The company plans to expand Isaacs' functionalities beyond folding clothes to include various household tasks, addressing diverse family needs [14]. - Privacy considerations are integrated into the design, with features that allow the robot to shut down its camera when idle [14].
AI Day直播 | MemoryVLA:助力长时序机器人操作任务
自动驾驶之心· 2025-09-03 03:19
Core Viewpoint - The article discusses the development of MemoryVLA, a cognitive-memory-action framework inspired by human memory systems, aimed at improving the performance of Vision-Language-Action (VLA) models in long-term robotic manipulation tasks [3][7]. Group 1: VLA Challenges and Solutions - Existing VLA models primarily rely on current observations, leading to poor performance in long-term, time-dependent tasks [7]. - Cognitive science indicates that humans utilize a memory system involving neural activity and the hippocampus to manage tasks effectively, which serves as the inspiration for MemoryVLA [7]. Group 2: MemoryVLA Framework - MemoryVLA incorporates a pre-trained Vision-Language Model (VLM) that encodes observations into perceptual and cognitive tokens, facilitating the formation of working memory [3]. - A Perceptual-Cognitive Memory Bank is established to store consolidated low-level details and high-level semantics, allowing for adaptive retrieval of relevant entries for decision-making [3]. Group 3: Implications for Robotics - The framework aims to enhance the ability of robots to perform tasks that require temporal awareness and memory, addressing the inherent nature of robotic manipulation tasks [3][7]. - The article also touches on the importance of memory and reasoning within VLA models, suggesting a need for further exploration in these areas [7].
MemoryVLA:给机器人装上海马体,助力长时序机器人操作任务
具身智能之心· 2025-09-03 00:03
Core Viewpoint - The article discusses the development of MemoryVLA, a cognitive-memory-action framework inspired by human memory systems, aimed at improving robotic manipulation tasks that require long-term temporal dependencies [3][7]. Group 1: Current Issues in VLA Models - Existing Vision-Language-Action (VLA) models primarily rely on current observations, leading to poor performance in long-term, temporally dependent tasks [2][7]. - Cognitive science indicates that humans utilize a memory system involving neural activity and the hippocampus to manage tasks effectively over time [7]. Group 2: MemoryVLA Framework - MemoryVLA is designed to create a memory system for robots, drawing inspiration from human cognitive mechanisms [3][7]. - The framework includes a pre-trained Vision-Language Model (VLM) that encodes observations into perceptual and cognitive tokens, which are stored in a Perceptual-Cognitive Memory Bank [3]. - Working memory retrieves relevant entries from the memory bank, merging them with current tokens to adaptively update the memory [3]. Group 3: Importance of Memory in Robotics - The article emphasizes the necessity of memory in robotic tasks, explaining that it enhances decision-making and action sequences in complex environments [3][7]. - A memory-conditioned diffusion action expert generates action sequences with temporal awareness using the tokens [3].
穆尧团队最新!离散扩散引入VLA,支持精确动作建模和一致性训练
具身智能之心· 2025-09-02 00:03
Core Viewpoint - The article discusses the introduction of the Discrete Diffusion VLA model, which integrates discrete diffusion techniques into the Vision-Language-Action (VLA) framework, enhancing the efficiency and accuracy of robotic action decoding [4][7]. Group 1: Background and Problem Statement - The VLA model enables robots to understand visual and language inputs and execute corresponding action sequences. Current VLA frameworks typically adapt large pre-trained visual-language models (VLM) by adding an action generation head [4]. - Existing decoding methods fall into two categories: autoregressive (AR) methods, which generate actions sequentially, and continuous diffusion methods, which treat action trajectories as continuous signals [4][6]. Group 2: Proposed Solution - The Discrete Diffusion VLA model introduces a novel approach by incorporating discrete diffusion into action decoding, utilizing a single Transformer to unify visual, language, and action modalities without the need for additional training modules [6][12]. - The model employs a "first easy, then difficult" adaptive decoding strategy, allowing for parallel decoding of actions and error correction, significantly improving accuracy [12][18]. Group 3: Performance Metrics - In the LIBERO task with the Franka Panda robotic arm, the model achieved a success rate of 96.3%, outperforming traditional AR and continuous diffusion models [2][12]. - The Google robot demonstrated a visual matching rate of 71.2%, while the WidowX robot achieved a 49.3% overall success rate in real-simulation transfer scenarios, showcasing the model's robustness [2][25]. Group 4: Experimental Results - The Discrete Diffusion VLA model consistently outperformed benchmarks, with an average success rate of 96.3% across various tasks, surpassing the closest model, OpenVLA-OFT, by 0.8% [21][22]. - The model's performance in visual matching and variant aggregation was also superior, achieving an overall average success rate of 64.1% in diverse scenarios [23][24]. Group 5: Ablation Studies - Ablation studies indicated that the adaptive decoding strategy significantly enhances performance, with the "max confidence" approach yielding a 97.4% success rate, outperforming other strategies [27]. - The temperature scheduling method used in the model also proved effective, achieving a 97.4% success rate, validating the synergy between temperature adjustment and adaptive decoding [28].
最新综述!多模态融合与VLM在具身机器人领域中的方法盘点
具身智能之心· 2025-09-01 04:02
Core Insights - The article discusses the transformative impact of Multimodal Fusion and Vision-Language Models (VLMs) on robot vision, enabling robots to evolve from simple mechanical executors to intelligent partners capable of understanding and interacting with complex environments [3][4][5]. Multimodal Fusion in Robot Vision - Multimodal fusion integrates various data types such as RGB images, depth information, LiDAR point clouds, language, and tactile data, significantly enhancing robots' perception and understanding of their surroundings [3][4][9]. - The main fusion strategies have evolved from early explicit concatenation to implicit collaboration within unified architectures, improving feature extraction and task prediction [10][11]. Applications of Multimodal Fusion - Semantic scene understanding is crucial for robots to recognize objects and their relationships, where multimodal fusion greatly improves accuracy and robustness in complex environments [9][10]. - 3D object detection is vital for autonomous systems, combining data from cameras, LiDAR, and radar to enhance environmental understanding [16][19]. - Embodied navigation allows robots to explore and act in real environments, focusing on goal-oriented, instruction-following, and dialogue-based navigation methods [24][26][27][28]. Vision-Language Models (VLMs) - VLMs have advanced significantly, enabling robots to understand spatial layouts, object properties, and semantic information while executing tasks [46][47]. - The evolution of VLMs has shifted from basic models to more sophisticated systems capable of multimodal understanding and interaction, enhancing their applicability in various tasks [53][54]. Future Directions - The article identifies key challenges in deploying VLMs on robotic platforms, including sensor heterogeneity, semantic discrepancies, and the need for real-time performance optimization [58]. - Future research may focus on structured spatial modeling, improving system interpretability, and developing cognitive VLM architectures for long-term learning capabilities [58][59].
基于大型VLM的VLA模型如何改一步一步推动机器人操作任务的发展?
具身智能之心· 2025-08-26 00:03
Core Viewpoint - The article discusses the transformative impact of large Vision-Language Models (VLMs) on robotic manipulation, enabling robots to understand and execute complex tasks through natural language instructions and visual cues [3][4][5]. Group 1: VLA Model Development - The emergence of Vision-Language-Action (VLA) models, driven by large VLMs, allows robots to interpret visual details and human instructions, converting this understanding into executable actions [4][5]. - The article highlights the evolution of VLA models, categorizing them into monolithic and hierarchical architectures, and identifies key challenges and future directions in the field [9][10][11]. Group 2: Research Contributions - The research from Harbin Institute of Technology (Shenzhen) provides a comprehensive survey of VLA models, detailing their definitions, core architectures, and integration with reinforcement learning and human video learning [5][9][10]. - The survey aims to unify terminology and modeling assumptions in the VLA field, addressing fragmentation across disciplines such as robotics, computer vision, and natural language processing [17][18]. Group 3: Technical Advancements - VLA models leverage the capabilities of large VLMs, including open-world generalization, hierarchical task planning, knowledge-enhanced reasoning, and rich multimodal integration [13][64]. - The article outlines the limitations of traditional robotic methods and how VLA models overcome these by enabling robots to handle unstructured environments and vague instructions effectively [16][24]. Group 4: Future Directions - The article emphasizes the need for advancements in 4D perception and memory mechanisms to enhance the capabilities of VLA models in long-term task execution [5][16]. - It also discusses the importance of developing unified frameworks for VLA models to improve their adaptability across various tasks and environments [17][66].