视觉-语言-动作(VLA)模型
Search documents
新国立提出VLA-4D:4D感知VLA模型,实现时空连贯的机器人操作
具身智能之心· 2025-11-25 00:03
点击下方 卡片 ,关注" 具身智能 之心 "公众号 作者丨 Hanyu Zhou等 编辑丨具身智能之心 本文只做学术分享,如有侵权,联系删文 >> 点击进入→ 具身智能之心 技术交流群 更多干货,欢迎加入国内首个具身智能全栈学习社区 : 具身智能之心知识星球 (戳我) , 这里包含所有你想要的。 一、 为什么提出4D感知VLA模型 视觉-语言-动作(VLA)模型已在通用机器人任务中展现潜力,但在需要精细表征的时空连贯操作任务中仍面临瓶颈: 核心目标是通过融合空间与时间信息,同时增强视觉推理和动作规划的精细度,实现机器人操作的空间平滑性与时间连贯性统一。 二、 核心设计与技术细节 2.1 整体框架 VLA-4D的核心创新在于双重视空融合:将4D(3D空间+1D时间)信息嵌入视觉表征用于推理,将时间变量融入动作表征用于规划,通过多模态对齐让大语言模型 (LLM)输出时空连贯的动作指令(figure 2)。 2D VLA模型依赖单帧图像输入,视觉推理粗糙,且存在2D-3D坐标不匹配问题,导致动作空间精度不足、时空不连续(figure 1a); 3D VLA模型虽将3D位置嵌入视觉特征以提升空间平滑性,但缺乏对时间维 ...
南洋理工大学提出NORA-1.5:一种基于世界模型与动作奖励的VLA模型
具身智能之心· 2025-11-21 00:04
点击下方 卡片 ,关注" 具身智能 之心 "公众号 作者丨 Chia-YuHung等 编辑丨具身智能之心 本文只做学术分享,如有侵权,联系删文 >> 点击进入→ 具身智能之心 技术交流群 更多干货,欢迎加入国内首个具身智能全栈学习社区 : 具身智能之心知识星球 (戳我) , 这里包含所有你想要的。 南洋理工大学等研究单位提出NORA-1.5 通过集成流匹配动作专家与奖励驱动的直接偏好优化(DPO)后训练,解决了现有视觉-语言-动作(VLA)模型泛化性和 可靠性不足的问题,在仿真与真实机器人场景中均实现了当前最优性能。 核心定位与解决的关键问题 架构设计:流匹配与 VLA backbone的协同优化 VLA backbone基础 论文标题 :NORA-1.5:AVision-Language-ActionModelTrainedusingWorldModel andAction-basedPreferenceRewards 论文链接 :https://arxiv.org/pdf/2511.14659 ProjectPage :https://declare-lab.github.io/nora-1.5 Code ...
VLA集体翻车?复旦&创智邱锡鹏教授团队提出LIBERO-Plus,揭示VLA脆弱性真相
具身智能之心· 2025-10-29 00:03
Core Insights - The article discusses the robustness analysis of Vision-Language-Action (VLA) models, revealing significant generalization deficiencies despite high performance scores in ideal conditions [2][4][6] - The LIBERO-Plus framework is introduced to systematically evaluate VLA models across various perturbation dimensions, highlighting the gap between surface performance and actual generalization capabilities [4][6][33] Group 1: Motivation and Contributions - VLA models have achieved impressive success rates in benchmarks like LIBERO, but existing evaluation methods fail to assess stability and reliability under real-world variations [4][6] - LIBERO-Plus evaluates models based on seven dimensions of perturbation: object placement, camera angle, robot initial pose, language instructions, lighting conditions, background textures, and sensor noise [4][6] - The framework provides a detailed analysis of VLA models' generalization performance through systematic perturbation [4][6] Group 2: Performance Analysis - The analysis reveals that VLA models exhibit significant overall vulnerability to perturbations, with performance declining across all dimensions [13][32] - Models are most sensitive to changes in camera perspective and robot initial state, indicating a need for high-level spatial and proprioceptive understanding [13][32] - Language perturbations lead to the smallest average performance drop (-25.3%), suggesting a surprising level of robustness that warrants further investigation [15][17] Group 3: Findings on Model Behavior - Some models maintain performance even with empty language inputs, indicating a tendency to ignore language modalities and behave more like visual-action (VA) models [16][19] - VLA models struggle with cross-object instruction following, relying more on fixed visual-action mappings rather than fully leveraging language signals [19][20] - The models demonstrate remarkable adaptability to background changes while showing limited sensitivity to lighting variations, raising questions about the representations they learn [20][27] Group 4: Combination Generalization - The concept of "combination generalization gap" is introduced, highlighting the negative interactions between different perturbations that exceed the independent effects of single perturbations [29][32] - The analysis indicates that current VLA models lack the ability to effectively handle complex multi-dimensional perturbations due to entangled representations [32] Group 5: LIBERO-Plus Benchmark - The LIBERO-Plus benchmark consists of 10,030 tasks designed to evaluate model performance under various perturbations, constructed using perturbation augmentation strategies [33][36] - The benchmark features include comprehensive coverage of seven perturbation dimensions and fine-grained difficulty levels [36] - Models trained with enhanced data achieved an average success rate of 79.6% on LIBERO-Plus, significantly outperforming baseline models [38]
SFT 还是RL,VLA到底应该如何训练?
具身智能之心· 2025-10-28 00:02
Core Insights - The articles focus on advancements in Reinforcement Learning (RL) and its application to Visual-Language-Action (VLA) models, highlighting significant improvements in generalization capabilities and training efficiency. Group 1: Research Findings - The first study investigates how RL enhances the generalization ability of VLA models, addressing issues related to supervised fine-tuning (SFT) that lead to error accumulation and distribution shift. A new benchmark covering visual, semantic, and execution dimensions was established, showing that using Proximal Policy Optimization (PPO) for RL fine-tuning significantly improves semantic understanding and execution robustness while maintaining comparable visual generalization performance to SFT [2]. - The second study introduces RLinf-VLA, a framework designed for large-scale RL training of VLA models. It proposes a novel solution to the challenges of integrating RL and VLA training, achieving up to 2.27 times acceleration compared to baseline methods. The framework supports various VLA architectures and RL algorithms, achieving a 98.11% success rate across 130 LIBERO tasks [3]. Group 2: Practical Applications - RLinf-VLA summarizes best practices for applying RL in VLA training, providing a unified interface that facilitates the use of multiple VLA architectures and simulators, thus lowering the barrier for implementing RL in large-scale VLA applications [3]. - The research emphasizes the importance of RL in enhancing the performance of VLA models, suggesting a shift towards more efficient training methodologies that leverage RL's strengths [15].
你的VLA太慢了!?算力不够也能提速:这篇综述教你打造高效VLA新范式
具身智能之心· 2025-10-24 16:03
Core Insights - The article emphasizes the importance of efficiency in Vision-Language-Action (VLA) models, which are crucial for enabling robots to understand their environment and execute tasks effectively. It identifies efficiency as a key bottleneck that hinders the transition of VLA models from research to practical applications [3][4][7]. Background and Value - The rapid development of embodied intelligence has led to the emergence of VLA models as a core framework for robotic task execution. However, current VLA systems face significant challenges related to computational and storage demands, as well as high inference latency, which are critical for real-time applications [3][4][7]. Efficiency Bottlenecks - The review systematically analyzes the efficiency issues in VLA models across four dimensions: model architecture, perception features, action generation, and training/inference processes. It highlights that efficiency challenges are systemic and not limited to single-point optimizations [3][4][7]. Classification Framework - The article categorizes existing efficient VLA strategies into four complementary dimensions: efficient architecture design, perception feature compression, action generation acceleration, and training/inference optimization. This classification provides a comprehensive understanding of the design logic and trade-offs of current methods [4][6][7]. Future Trends and Directions - The review outlines future directions for VLA models, emphasizing the need for a balance between capability enhancement and computational cost. Key areas for efficiency optimization include data utilization, perception features, action generation, and learning strategies [4][25][26]. Efficient Perception Features - Optimizing visual input, which constitutes the largest computational overhead in VLA models, can be approached through selective processing of features and temporal feature reuse. These strategies aim to reduce redundant calculations while maintaining performance [13][15][16]. Efficient Action Generation - Action generation strategies focus on minimizing latency while ensuring task accuracy. Techniques include outputting low-dimensional continuous action vectors and introducing explicit reasoning to enhance interpretability and generalization across tasks [18][21]. Efficient Training and Inference - Training strategies aim to reduce adaptation costs for new tasks and environments through methods like parameter-efficient fine-tuning and knowledge distillation. Inference strategies focus on breaking the autoregressive bottleneck to enable parallelization and mixed decoding [22][24]. Future Outlook - The article suggests that future VLA models should prioritize collaborative design between models and data, efficient spatiotemporal perception, and robust action encoding. It also calls for a standardized evaluation framework to measure efficiency improvements [25][26][27].
纯血VLA综述来啦!从VLM到扩散,再到强化学习方案
自动驾驶之心· 2025-09-30 16:04
Core Insights - The article discusses the emergence and potential of Vision Language Action (VLA) models in robotics, emphasizing their ability to integrate perception, language understanding, and action execution into a unified framework [10][16]. Group 1: Introduction and Background - Robotics has evolved from relying on pre-programmed instructions to utilizing deep learning for multi-modal data processing, enhancing capabilities in perception and action [1][10]. - The introduction of large language models (LLMs) and vision-language models (VLMs) has significantly improved the flexibility and precision of robotic operations [1][10]. Group 2: Current State of VLA Models - VLA methods are categorized into four paradigms: autoregressive, diffusion, reinforcement learning, and hybrid/specialized methods, each with unique strategies and mechanisms [7][9]. - The development of VLA models is heavily dependent on high-quality datasets and realistic simulation platforms, which are crucial for training and evaluation [15][17]. Group 3: Challenges and Future Directions - Key challenges in VLA research include data limitations, reasoning speed, and safety concerns, which need to be addressed to advance the field [7][9]. - Future research directions are identified, focusing on enhancing generalization capabilities, improving interaction with dynamic environments, and ensuring robust performance in real-world applications [16][17]. Group 4: Methodological Innovations - The article highlights the transition from traditional robotic systems to VLA models, which unify visual perception, language understanding, and executable control in a single framework [13][16]. - Innovations in VLA methodologies include the integration of autoregressive models for action generation, diffusion models for probabilistic action generation, and reinforcement learning for policy optimization [18][32]. Group 5: Applications and Impact - VLA models have been applied across various robotic platforms, including robotic arms, quadrupeds, humanoid robots, and autonomous vehicles, showcasing their versatility [7][15]. - The integration of VLA models is seen as a significant step towards achieving general embodied intelligence, enabling robots to perform a wider range of tasks in diverse environments [16][17].
纯血VLA综述来啦!从VLM到扩散,再到强化学习方案
具身智能之心· 2025-09-30 04:00
Core Insights - The article discusses the evolution and potential of Vision Language Action (VLA) models in robotics, emphasizing their integration of perception, language understanding, and action generation to enhance robotic capabilities [11][17]. Group 1: Introduction and Background - Robotics has traditionally relied on pre-programmed instructions and control strategies, limiting their adaptability in dynamic environments [2][11]. - The emergence of VLA models marks a significant advancement in embodied intelligence, combining visual perception, language understanding, and executable actions into a unified framework [11][12]. Group 2: VLA Methodologies - VLA methods are categorized into four paradigms: autoregressive, diffusion, reinforcement learning, and hybrid/specialized methods, each with unique strategies and mechanisms [8][10]. - The article highlights the importance of high-quality datasets and realistic simulation platforms for the development and evaluation of VLA models [16][18]. Group 3: Challenges and Future Directions - Key challenges identified include data limitations, reasoning speed, and safety concerns, which need to be addressed to advance VLA models and general robotics [10][17]. - Future research directions focus on enhancing the robustness and generalization of VLA models in real-world applications, emphasizing the need for efficient training paradigms and safety assessments [44][47].
AnywhereVLA:在消费级硬件上实时运行VLA
具身智能之心· 2025-09-29 02:08
Core Background and Objectives - The current mobile operation technology is expanding from closed, structured work units to open, unstructured large indoor environments, requiring robots to explore unfamiliar and cluttered spaces, interact with diverse objects and humans, and respond to natural language commands for tasks such as home service, retail automation, and warehousing logistics [3] - AnywhereVLA proposes a modular architecture that integrates the robustness of classical navigation with the semantic understanding capabilities of VLA models to achieve language-driven pick-and-place tasks in unknown large indoor environments, capable of real-time operation on consumer-grade hardware [3] Review of Existing Solutions: Advantages and Limitations - VLA models and lightweight optimization strategies are discussed, highlighting their limitations in spatial perception and adaptability to large environments [4] - Existing solutions like MoManipVLA and SmolVLA show performance close to larger models while reducing resource requirements, but they lack spatial awareness for large environments [4] - The limitations of visual-language navigation (VLN) and classical navigation frameworks are outlined, emphasizing the need for improved language understanding and semantic reasoning capabilities [4] AnywhereVLA Architecture: Four Core Modules and Workflow - The AnywhereVLA architecture processes natural language commands through four modules to output low-level control instructions for driving base wheels and robotic arm joints [4] - The workflow includes language instruction parsing, guiding VLA operations, constructing 3D semantic maps, and executing operations based on the identified targets [7] VLA Model Fine-tuning and Hardware Platform - The SmolVLA model is fine-tuned to enhance its operational capabilities, with specific input data and key steps outlined for optimizing performance [13][15] - The HermesBot mobile operation platform is designed specifically for AnywhereVLA, balancing sensing and computational capabilities [16] Experimental Results: Performance and Effectiveness Validation - In an unknown multi-room laboratory setting, 50 pick-and-place tasks were executed, with a core success rate of 46%, and the fine-tuned SmolVLA operation module achieving an 85% success rate [17][22] - The performance metrics for various modules are provided, indicating robust SLAM performance and varying success rates for active environment exploration, navigation, object detection, and VLA manipulation [22] - Time efficiency metrics show that the average task completion time is under 133 seconds for a 5m exploration radius, meeting real-time scene requirements [23]
从300多篇工作中,看VLA在不同场景下的应用和实现......
具身智能之心· 2025-09-25 04:00
Core Insights - The article discusses the emergence of Vision Language Action (VLA) models, marking a shift in robotics from traditional strategy-based control to a more generalized robotic technology paradigm, enabling active decision-making in complex environments [2][5][20] - It emphasizes the integration of large language models (LLMs) and vision-language models (VLMs) to enhance robotic operations, providing greater flexibility and precision in task execution [6][12] - The survey outlines a clear classification system for VLA methods, categorizing them into autoregressive, diffusion, reinforcement learning, hybrid, and specialized methods, while also addressing the unique contributions and challenges within each category [7][10][22] Group 1: VLA Model Overview - VLA models represent a significant advancement in robotics, allowing for the unification of perception, language understanding, and executable control within a single modeling framework [15][20] - The article categorizes VLA methods into five paradigms: autoregressive, diffusion, reinforcement learning, hybrid, and specialized, detailing their design motivations and core strategies [10][22][23] - The integration of LLMs into VLA systems transforms them from passive input parsers to semantic intermediaries, enhancing their ability to handle long and complex tasks [29][30] Group 2: Applications and Challenges - VLA models have practical applications across various robotic forms, including robotic arms, quadrupeds, humanoid robots, and autonomous vehicles, showcasing their deployment in diverse scenarios [8][20] - The article identifies key challenges in the VLA field, such as data limitations, reasoning speed, and safety concerns, which need to be addressed to accelerate the development of VLA models and general robotic technology [8][19][20] - The reliance on high-quality datasets and simulation platforms is crucial for the effective training and evaluation of VLA models, addressing issues of data scarcity and real-world testing risks [16][19] Group 3: Future Directions - The survey outlines future research directions for VLA, including addressing data limitations, enhancing reasoning speed, and improving safety measures to facilitate the advancement of general embodied intelligence [8][20][21] - It highlights the importance of developing scalable and efficient VLA models that can adapt to various tasks and environments, emphasizing the need for ongoing innovation in this rapidly evolving field [20][39] - The article concludes by underscoring the potential of VLA models to bridge the gap between perception, understanding, and action, positioning them as a key frontier in embodied artificial intelligence [20][21][39]
深度综述 | 300+论文带你看懂:纯视觉如何将VLA推向自动驾驶和具身智能巅峰!
自动驾驶之心· 2025-09-24 23:33
Core Insights - The emergence of Vision Language Action (VLA) models signifies a paradigm shift in robotics from traditional strategy-based control to general-purpose robotic technology, transforming Vision Language Models (VLMs) from passive sequence generators to active agents capable of executing operations and making decisions in complex, dynamic environments [1][5][11] Summary by Sections Introduction - Robotics has historically relied on pre-programmed instructions and control strategies for task execution, primarily in simple, repetitive tasks [5] - Recent advancements in AI and deep learning have enabled the integration of perception, detection, tracking, and localization technologies, leading to the development of embodied intelligence and autonomous driving [5] - Current robots often operate as "isolated agents," lacking effective interaction with humans and external environments, prompting researchers to explore the integration of Large Language Models (LLMs) and VLMs for more precise and flexible robotic operations [5][6] Background - The development of VLA models marks a significant step towards general embodied intelligence, unifying visual perception, language understanding, and executable control within a single modeling framework [11][16] - The evolution of VLA models is supported by breakthroughs in single-modal foundational models across computer vision, natural language processing, and reinforcement learning [13][16] VLA Models Overview - VLA models have rapidly developed due to advancements in multi-modal representation learning, generative modeling, and reinforcement learning [24] - The core design of VLA models includes the integration of visual encoding, LLM reasoning, and decision-making frameworks, aiming to bridge the gap between perception, understanding, and action [23][24] VLA Methodologies - VLA methods are categorized into five paradigms: autoregressive, diffusion models, reinforcement learning, hybrid methods, and specialized approaches, each with distinct design motivations and core strategies [6][24] - Autoregressive models focus on sequential generation of actions based on historical context and task instructions, demonstrating scalability and robustness [26][28] Applications and Resources - VLA models are applicable in various robotic domains, including robotic arms, quadrupedal robots, humanoid robots, and wheeled robots (autonomous vehicles) [7] - The development of VLA models heavily relies on high-quality datasets and simulation platforms to address challenges related to data scarcity and high risks in real-world testing [17][21] Challenges and Future Directions - Key challenges in the VLA field include data limitations, reasoning speed, and safety concerns, which need to be addressed to accelerate the development of VLA models and general robotic technologies [7][18] - Future research directions are outlined to enhance the capabilities of VLA models, focusing on improving data diversity, enhancing reasoning mechanisms, and ensuring safety in real-world applications [7][18] Conclusion - The review emphasizes the need for a clear classification system for pure VLA methods, highlighting the significant features and innovations of each category, and providing insights into the resources necessary for training and evaluating VLA models [9][24]