Workflow
OpenVLA
icon
Search documents
基于大型VLM的VLA模型如何改一步一步推动机器人操作任务的发展?
具身智能之心· 2025-08-26 00:03
点击下方 卡片 ,关注" 具身智能 之心 "公众号 编辑丨具身智能之心 本文只做学术分享,如有侵权,联系删文 >> 点击进入→ 具身智能之心 技术交流群 更多干货,欢迎加入国内首个具身智能全栈学习社区 : 具身智能之心知识星球 (戳我) , 这里包含所有你想要的。 当机器人 "看懂" 指令还能 "自主干活":大型 VLM 如何改写机器人操作的游戏规则? 你是否想象过这样的场景:对着机器人说一句 "把阳台晾干的衬衫叠好放进衣柜第三层",它就能看懂衣物位置、理解 "叠好""放进" 的动作逻辑,甚至避开衣柜里的 杂物完成任务?放在几年前,这更像科幻电影里的情节 —— 传统机器人要么困在 "预定义任务牢笼" 里,换个新杯子就认不出;要么面对模糊的自然语言指令 "手 足无措",更别提在杂乱的真实环境里灵活调整动作。 但现在,一场由 "视觉 - 语言 - 动作(VLA)模型" 掀起的变革,正在打破这些局限。而这场变革的核心推手,正是我们如今耳熟能详的大型视觉语言模型 (VLM)。 过去,机器人操作的研究总在 "模块化陷阱" 里打转:视觉识别、语言解析、动作控制各成一派,像被割裂的齿轮,很难协同运转。直到大型 VLMs 的 ...
3个月!完成你的具身大脑+小脑算法学习
具身智能之心· 2025-08-25 00:04
在通往通用人工智能(AGI)的探索中,具身智能逐渐成为关键方向之一。相比于传统的预设动作序列不 同,具身智能强调智能体与物理环境的交互与适应,聚焦于如何让智能体具备在物理世界中感知环境、理 解任务、执行动作并反馈学习的能力。 而具身智能领域最重要的两个部分:大脑和小脑构成了具身机器人最重要的模块,如果类比于人,大脑负 责思考感知(主导语义理解和任务规划),小脑负责执行(高精度的运动执行)。 国内外相关领域产业分析 近2年,许多具身明星团队陆续出来创业,成立了多家非常有价值的公司。星海图、银河通用、逐际动力等 团队陆续从实验室走向商业和工业界,推动具身本体和大小脑技术的不断进步。 国内传统大厂,华为于2024年底启动"全球具身智能产业创新中心",与乐聚机器人、大族机器人等企业合 作,共同建设具身智能大脑、小脑等关键技术;京东自2025年5月以来连续投资智元机器人、千寻智能、逐 际动力等多家公司,以强化其在物流科技与家庭服务场景中的效率与服务能力。此外,腾讯、蚂蚁集团、 小米等科技巨头也积极通过战略投资与合作布局,加快构建具身智能产业生态。 国外方面,Tesla/Figure AI在工业与物流机器人应用上持续推进 ...
从方法范式和应用场景上看强化与VLA/Flow Matching/机器人控制算法
具身智能之心· 2025-08-19 01:54
Core Viewpoint - The article discusses recent advancements in reinforcement learning (RL) and its applications in robotics, particularly focusing on the VLA (Vision-Language Action) models and diffusion policies, highlighting their potential to handle complex tasks that traditional RL struggles with [2][4][35]. Method Paradigms - Traditional RL and imitation learning combined with Sim2Real techniques are foundational approaches in robotics [3]. - VLA models differ fundamentally from traditional RL by using training data distributions to describe task processes and goals, allowing for the execution of more complex tasks [4][35]. - Diffusion Policy is a novel approach that utilizes diffusion models to generate continuous action sequences, demonstrating superior capabilities in complex task execution compared to traditional RL methods [4][5]. Application Scenarios - The article categorizes applications into two main types: basic motion control for humanoid and quadruped robots, and complex/long-range operational tasks [22][23]. - Basic motion control primarily relies on RL and Sim2Real, with current implementations still facing challenges in achieving fluid motion akin to human or animal movements [22]. - For complex tasks, architectures typically involve a pre-trained Vision Transformer (ViT) encoder and a large language model (LLM), utilizing diffusion or flow matching for action output [23][25]. Challenges and Future Directions - The article identifies key challenges in the field, including the need for better simulation environments, effective domain randomization, and the integration of external goal conditions [35]. - It emphasizes the importance of human intention in task definition and the limitations of current models in learning complex tasks without extensive human demonstration data [35][40]. - Future advancements may involve multi-modal input predictions for task goals and the potential integration of brain-machine interfaces to enhance human-robot interaction [35].
Spec-VLA:首个专为VLA推理加速设计的推测解码框架
具身智能之心· 2025-08-02 16:02
Core Viewpoint - The article discusses the development of Spec-VLA, a speculative decoding framework designed to accelerate Vision-Language-Action (VLA) models, addressing challenges related to computational demands and decoding delays [3][4][16]. Research Background and Motivation - VLA models have shown significant progress in generating robot action sequences based on language instructions, but they face challenges such as the large parameter size of backbone Visual Language Models (VLMs) and increased decoding latency due to autoregressive decoding strategies [3]. - Existing acceleration methods have limitations, necessitating a tailored approach for VLA models [3]. Core Framework: Spec-VLA - Spec-VLA introduces a collaborative mechanism between draft and validation models to enhance inference speed, utilizing a draft model to predict action tokens and a validation model to ensure output quality [4][5]. Key Mechanism: Relaxed Acceptance - The relaxed acceptance mechanism allows for a defined threshold of acceptable distance between draft and validation model predictions, facilitating a more efficient decoding process without significant computational overhead [7][10]. Experimental Validation - The framework was evaluated on the LIBERO simulation benchmark across four task sets, demonstrating significant improvements in speed and acceptance length while maintaining success rates [9][10]. - The introduction of relaxed acceptance led to an acceleration factor of 1.22× to 1.42×, with acceptance length increasing by 25%-44% [10][11]. Key Results - The results indicate that as the relaxed threshold increases, the acceptance length significantly improves while maintaining stable success rates across various datasets [10][11]. - Case studies show that relaxed conditions reduce the number of iterations needed to complete action sequences, validating the effectiveness of the relaxed acceptance mechanism [13]. Conclusion and Limitations - Spec-VLA demonstrates the potential of speculative execution in VLA prediction tasks, achieving a speedup of 1.42× and a 44% increase in acceptance length without compromising success rates [16]. - Limitations include the lack of real-world robot scenario testing and the exploration of action chunking strategies [16].
分析了102个VLA模型、26个数据集和12个仿真平台
自动驾驶之心· 2025-07-22 02:18
Core Viewpoint - The article discusses the transformative breakthrough of Visual-Language-Action (VLA) models in robotics, emphasizing their integration of visual perception, natural language understanding, and embodied control within a unified learning framework. It highlights the development and evaluation of 102 VLA models, 26 foundational datasets, and 12 simulation platforms, identifying current challenges and future directions for enhancing robotic autonomy and adaptability [3][4][6]. Group 1: VLA Models and Framework - VLA models represent a new frontier in robotic intelligence, enabling robots to perceive visual environments, understand natural language commands, and execute meaningful actions, bridging the semantic gap between various modalities [7][9]. - The architecture of VLA models integrates visual, language, and proprioceptive encoders into a diffusion backbone network to generate control commands, facilitating end-to-end processing of multimodal inputs [11][12]. - The development of effective VLA models relies on large-scale, diverse multimodal datasets and realistic simulation platforms, which are crucial for training models to robustly understand language instructions and perceive visual environments [5][30]. Group 2: Datasets and Evaluation - The article outlines the evolution of VLA datasets, noting that early datasets focused on discrete decision-making in constrained environments, while recent datasets incorporate richer sensory streams and longer task durations, addressing the need for complex multimodal control challenges [21][22][29]. - A comprehensive benchmarking strategy is proposed to evaluate datasets based on task complexity and modality richness, highlighting the need for new datasets that integrate high task difficulty with extensive multimodal inputs [24][28]. - The analysis reveals a gap in current VLA benchmarks, particularly in combining long-duration, multi-skill control with diverse multimodal integration, indicating a promising direction for future dataset development [29][43]. Group 3: Simulation Tools - Simulation environments are critical for VLA research, enabling the generation of large-scale, repeatable, and richly annotated data that surpasses physical world limitations [30][31]. - Various advanced simulation platforms, such as AI2-THOR and NVIDIA Isaac Sim, provide high-fidelity physical effects and customizable multimodal sensors, essential for developing robust VLA models [32][33]. - The integration of simulation tools with VLA datasets accelerates the collaborative development of control algorithms and benchmark datasets, ensuring advancements in multimodal perception are effectively evaluated before deployment in real robotic platforms [30][33]. Group 4: Applications and Challenges - VLA models are categorized into six broad application areas, including manipulation and task generalization, autonomous mobility, human assistance, and interaction, showcasing their versatility across various robotic tasks [34][35]. - The article identifies key challenges in VLA model architecture, such as tokenization and vocabulary alignment, modality fusion, and cross-entity generalization, which need to be addressed to enhance model performance and adaptability [39][40][41]. - Data challenges are also highlighted, including task diversity, modality imbalance, annotation quality, and the trade-off between realism and scale in datasets, which hinder the development of robust general-purpose VLA models [42][43].
VLA 推理新范式!一致性模型 CEED-VLA 实现四倍加速!
机器之心· 2025-07-13 04:58
Core Viewpoint - The article discusses the advancements in Vision-Language-Action (VLA) models, particularly focusing on the CEED-VLA model, which significantly improves inference speed while maintaining high task success rates in robotic applications [2][8][24]. Group 1: VLA Model Overview - VLA models have become a crucial research direction in robotics due to their strong multimodal understanding and generalization capabilities [2]. - Despite advancements, VLA models face significant inference speed bottlenecks, especially in high-frequency and precise tasks [2]. Group 2: Proposed Solutions - The article introduces a consistency distillation training strategy that allows the model to predict multiple correct action tokens simultaneously, enhancing decoding speed [4]. - A mixed-label supervision mechanism is designed to mitigate potential error accumulation during the distillation process [4][9]. - An early-exit decoding strategy is proposed to address inefficiencies in Jacobi decoding, allowing for improved average inference efficiency by relaxing convergence conditions [5][10]. Group 3: Experimental Results - The proposed methods achieved over 4 times inference acceleration across multiple baseline models while maintaining high task success rates in both simulated and real-world robotic tasks [8][18]. - The CEED-VLA model demonstrated a significant increase in manipulation task success rates, exceeding 70%, due to enhanced inference speed and control frequency [24].
VLA爆发!从美国RT-2到中国FiS-VLA,机器人的终极进化
具身智能之心· 2025-07-09 14:38
Core Viewpoint - The article emphasizes the rapid evolution and significance of Vision-Language-Action (VLA) models in the field of embodied intelligence, highlighting their potential to revolutionize human-robot interaction and the robotics industry as a whole [4][6][17]. Group 1: VLA Model Development - VLA models are becoming the core driving force in embodied intelligence, gaining traction among researchers and companies globally [7][8]. - Google recently released the first offline VLA model, enabling robots to perform tasks without internet connectivity [9]. - The emergence of the Fast-in-Slow (FiS-VLA) model in China represents a significant advancement, integrating fast and slow systems to enhance robotic control efficiency and reasoning capabilities [10][12]. Group 2: Academic and Industry Trends - There has been an explosive growth in academic papers related to VLA, with 1,390 papers published this year alone, accounting for nearly half of all related research [14]. - The VLA technology is facilitating the transition of robots from laboratory settings to real-world applications, indicating its vast potential [16][17]. Group 3: Key Innovations and Breakthroughs - The RT-2 model from Google marked a pivotal moment in VLA development, introducing a unified model architecture that integrates visual, language, and action modalities [38][40]. - The RoboMamba model, developed in China, significantly improved efficiency and reasoning capabilities in VLA models, achieving a threefold increase in inference speed compared to mainstream models [52][48]. - OpenVLA, another significant model, demonstrated superior performance in various tasks while being more efficient than previous models, achieving a 16.5% higher success rate than RT-2 [57][58]. Group 4: Future Directions and Implications - The introduction of the π series models aims to enhance VLA's generalization capabilities, allowing robots to perform complex tasks with minimal training [62][70]. - The FiS-VLA model represents a breakthrough in real-time control, achieving an 11% improvement in success rates in real environments compared to existing methods [114]. - The advancements in VLA technology are paving the way for robots to operate effectively in diverse environments, marking a significant step towards achieving Artificial General Intelligence (AGI) [127][123].
从坐标混乱到时空对齐!诺亚和复旦联合提出4D-VLA,提升机器人预训练效率和稳健性
具身智能之心· 2025-07-06 11:54
Core Insights - The article introduces 4D-VLA, a new pretraining method that integrates 3D spatial and historical frame data to enhance model performance in complex scenarios, addressing the limitations of traditional single-frame RGB and text inputs [4][10][18]. Group 1: Limitations of Existing Paradigms - Current mainstream methods like OpenVLA rely solely on single-frame RGB images and text instructions, leading to chaotic target distributions and slow model convergence due to high variance [7][8]. - The lack of complete input information results in significant challenges, such as coordinate system chaos and state chaos, which severely degrade pretraining efficiency [5][9]. Group 2: Proposed Solutions - 4D-VLA utilizes depth maps and camera extrinsics to project each pixel into world coordinates, embedding 3D positional encoding to align visual tokens with robot coordinates, thus reducing ambiguity in coordinate systems [10][18]. - The method includes a controlled experiment to quantify the impact of coordinate chaos on VLA models, demonstrating that the introduction of 3D information significantly improves model robustness and convergence speed [11][17]. Group 3: Experimental Setup and Results - The DROID dataset, comprising 76,000 human demonstration trajectories across various tasks, serves as the foundation for pretraining, while the LIBERO simulation suite is used for downstream evaluation [29][30]. - 4D-VLA outperforms existing methods in various tasks, achieving an average success rate of 88.6% across different evaluation settings, showcasing its superior capability in spatial awareness and generalization [33][39]. Group 4: Real-World Evaluation - In real-world tests, 4D-VLA demonstrated enhanced precision and robustness in tasks involving spatial generalization, robustness to distractors, precise placement, and structured instruction execution [44][49]. - The model maintained high success rates even under unseen camera angles, indicating its ability to adapt to new environments and conditions effectively [57][58].
北航×新国立×上交发布RoboCerebra:长时序机器人操作推理的全新评测基准
具身智能之心· 2025-06-28 07:48
Core Insights - The article discusses the development of RoboCerebra, a new benchmark designed to evaluate long-horizon robotic manipulation tasks, emphasizing the need for collaboration between high-level planning (VLM) and low-level control (VLA) models [6][8][10]. Group 1: Background and Motivation - Recent advancements in visual-language models (VLM) have enabled robots to execute commands based on natural language, but as tasks become more complex, a dual system involving both a "brain" (VLM) for planning and a "controller" (VLA) for execution is necessary [6][7]. - Existing benchmarks often fail to assess the collaborative capabilities of these systems, leading to the creation of RoboCerebra to evaluate long-term planning and memory management [8]. Group 2: RoboCerebra Contributions - RoboCerebra includes a large-scale dataset and a systematic benchmark for assessing cognitive challenges related to planning, memory, and reflection in robotic tasks [10]. - The dataset construction process integrates automated generation and manual annotation to ensure high quality and scalability [10]. Group 3: Task Setting - The benchmark features long task sequences averaging 2,972 steps, with dynamic disturbances introduced to challenge the models' planning and recovery abilities [13]. - A top-down data generation pipeline utilizes GPT to create high-level tasks, which are then broken down into sub-goals and verified for feasibility [13][14]. Group 4: Evaluation Protocol and Metrics - RoboCerebra employs a four-dimensional evaluation system that includes success rate, plan match accuracy, plan efficiency, and action completion accuracy to assess the collaboration between VLM and VLA [15][21]. - The framework introduces anchor points to synchronize evaluation across different models, ensuring consistency in task execution [21]. Group 5: Experimental Results - The hierarchical framework demonstrates that the collaboration between VLM and VLA significantly improves task success rates, particularly in memory execution scenarios, with improvements exceeding 70% [27]. - The results indicate that neither the VLA nor the VLM alone can effectively handle long-horizon tasks, highlighting the necessity of their integration [27][28]. Group 6: Model Evaluation - GPT-4o outperforms other models in planning accuracy, task success rate, and plan efficiency, underscoring the importance of strong language reasoning capabilities in executing long-term tasks [30]. - In memory-related tasks, GPT-4o shows superior exploration and execution decision-making abilities compared to other models, indicating its robustness in understanding scenes and recalling memories [31].