OpenVLA
Search documents
机器人开源革命:“免费大脑”背后的四派力量与博弈【机器人系列】
硅谷101· 2026-03-27 01:19
为什么机器人行业有这么多开源模型 这是做慈善还是钱太多烧得慌 为什么机器人开源模型能够打败谷歌 这背后是谁在下怎么样的一盘大棋呢 2月前后 小米、蚂蚁、阿里达摩院、宇树 纷纷发布机器人开源模型 再往前 英伟达在CES上 发布了GR00T N1.6% 把自家号称 “世界首个开放人形机器人基础模型” 又再度升级 我们不仅开源了模型 还开源了用于训练这些模型的数据 这些个消费电子公司 互联网巨头 还有芯片帝国 最近都一股脑地 把机器人的“大脑”拿出来 免费给全世界用 机器人开源模型的生态当中 有什么样的心机 和万亿美元押注的博弈呢 Hello 大家好 欢迎来到《硅谷101》 我是陈茜 这个视频我们就来继续聊聊机器人系列 之前我们机器人“闭源模型”那期 分析了如今具身智能通用的VLA模型 拆解了特斯拉、Figure 这些闭源巨头的不同路线 以及它们如何用硬件和数据优势 构筑护城河 而这个视频 我们与全球顶尖 具身智能实验室的研究人员深聊之后 来扒一扒开源算法路线中的核心玩家 和关键的技术领军人物们 同时我们来试图回答这三个问题 第一 这些开源模型 分别走了什么样的技术路线 为什么能够挑战巨头 第二 开源的动机是什么 ...
北大新作EvoVLA:大幅降低机器人幻觉,长序列成功率暴涨10%
具身智能之心· 2025-11-30 03:03
Core Viewpoint - The article discusses the emergence of EvoVLA, a self-evolving Vision-Language-Action model developed by a team from Peking University, which addresses the issue of "stage hallucination" in existing VLA models during long-horizon tasks, significantly improving success rates and reducing hallucination rates [1][5][40]. Group 1: Problem Identification - Embodied AI is on the verge of a breakthrough, but existing VLA models exhibit a critical weakness in long-horizon manipulation tasks, often leading to "cheating" behaviors [2]. - In long sequence tasks, VLA models frequently experience "stage hallucination," where they mistakenly believe they have completed a task when they have not [3][4]. Group 2: Solution Overview - The Peking University research team proposed the EvoVLA framework, which utilizes a self-supervised approach to enhance VLA model performance [5]. - EvoVLA incorporates three core modules that work in synergy to create a closed-loop self-supervised reinforcement learning system [10]. Group 3: Key Innovations - **Stage Alignment Reward (SAR)**: This innovative reward function addresses hallucination issues by providing detailed semantic descriptions of task stages, generated using the Gemini model [11][13]. - **Pose-Based Object Exploration (POE)**: This mechanism shifts the focus from pixel prediction to exploring the geometric relationships between objects and the robot's gripper, enhancing the efficiency of the exploration process [17][19][21]. - **Long-Horizon Memory**: EvoVLA employs a context selection mechanism to retrieve the most relevant historical information, preventing catastrophic forgetting during complex tasks [22][23][25]. Group 4: Benchmarking and Results - The team introduced the Discoverse-L benchmark, which includes three progressively challenging tasks: Stack, Jujube-Cup, and Block Bridge, to validate long-horizon capabilities [26][27][28][29]. - EvoVLA achieved an average success rate of 69.2% on the Discoverse-L benchmark, surpassing the previous best model, OpenVLA-OFT, by 10.2% [34]. - In real-world applications, EvoVLA demonstrated strong Sim2Real generalization, achieving a success rate of 55.2% in a novel stacking and insertion task, outperforming OpenVLA-OFT by 13.4% [37]. Group 5: Conclusion - The introduction of EvoVLA provides an elegant solution to the reliability issues faced by VLA models in long-horizon tasks, showcasing the potential of improved reward design, exploration mechanisms, and memory strategies in advancing embodied AI [40][41]. - The self-evolving paradigm, utilizing large language models to generate "error sets" for strategy learning, may be a crucial step towards autonomous learning in general robotics [42].
VLA2:浙大x西湖大学提出智能体化VLA框架,操作泛化能力大幅提升
具身智能之心· 2025-10-24 00:40
Core Insights - The article presents VLA², a framework designed to enhance the capabilities of vision-language-action models, particularly in handling unseen concepts in robotic tasks [1][3][12] Method Overview - VLA² integrates three core modules: initial information processing, cognition and memory, and task execution [3][5] - The framework utilizes GLM-4V for task decomposition, MM-GroundingDINO for object detection, and incorporates web image retrieval for visual memory enhancement [4][7] Experimental Validation - VLA² was compared with state-of-the-art (SOTA) models on the LIBERO Benchmark, showing competitive results, particularly excelling in scenarios requiring strong generalization [6][9] - In hard scenarios, VLA² achieved a 44.2% improvement in success rate over simply fine-tuning OpenVLA [9][10] Key Mechanisms - The framework's performance is significantly influenced by three mechanisms: visual mask injection, semantic replacement, and web retrieval [7][11] - Ablation studies confirmed that each mechanism contributes notably to the model's performance, especially in challenging tasks [11] Conclusion and Future Directions - VLA² successfully expands the cognitive and operational capabilities of VLA models for unknown objects, providing a viable solution for robotic tasks in open-world settings [12] - Future work will focus on exploring its generalization capabilities in real-world applications and expanding support for more tools and tasks [12]
RLINF-VLA:一种用于 VLA+RL 训练的统一高效框架
具身智能之心· 2025-10-22 06:02
Core Insights - The article presents the RLinf-VLA framework, a unified and efficient framework for training Visual-Language-Action (VLA) models using Reinforcement Learning (RL), addressing the limitations of existing models that rely on supervised fine-tuning [2][53] - The framework significantly enhances training efficiency and generalization capabilities, achieving high success rates in various simulation tasks and demonstrating superior performance in real-world applications compared to traditional supervised methods [5][53] Framework Design - The RLinf-VLA framework integrates multiple simulators, algorithms, and VLA architectures, optimizing resource allocation through flexible execution modes and system-level enhancements [4][53] - It supports three GPU allocation strategies: colocated, disaggregated, and hybrid, allowing users to easily switch modes via configuration files, thus reducing system customization costs [10][11] Model Compatibility - The framework supports LoRA for efficient parameter tuning, reducing memory consumption and accelerating training while maintaining performance [12] - It is compatible with OpenVLA and its extension OpenVLA-OFT, which have shown strong performance in various robotic operation benchmarks [12][22] Multi-Simulator Support - The framework emphasizes the importance of simulators in RL, utilizing ManiSkill and LIBERO as primary simulators to achieve diverse task capabilities [13] - It provides a unified interface for different simulators, facilitating the implementation of various tasks and supporting multiple RL algorithms, initially focusing on PPO and GRPO [13][14] Algorithm Design - The framework incorporates advanced techniques for advantage function and log-probability calculations, allowing for flexible integration of block-level and action-level definitions [14][15] - It supports various optimization strategies, including trajectory length normalization and effective action masking, to enhance training stability and performance [19][20] Experimental Results - The RLinf-VLA framework demonstrated significant performance improvements, with success rates increasing by 45% to 70% in various tasks compared to baseline models [22][24] - In LIBERO tasks, the framework achieved an average success rate of 98.11%, showcasing its capability for large-scale multi-task reinforcement learning [28] High Efficiency Performance - The framework's efficiency is evaluated based on throughput, achieving substantial improvements in training speed across different GPU configurations [30][35] - The hybrid allocation mode outperformed traditional methods, demonstrating the benefits of pipeline overlapping in resource utilization [35][37] Real-World Deployment - The RLinf-VLA framework was successfully deployed in real-world environments, showing superior zero-shot generalization capabilities compared to supervised fine-tuning strategies [51][53] - The experiments indicated that RL-trained models could adapt better to real-world tasks, achieving higher success rates in object manipulation tasks [51] Conclusion - The RLinf-VLA framework represents a significant advancement in the field of embodied intelligence, providing a robust foundation for future research and development in VLA training [53]
统一高效VLA+RL训练平台RLinf-VLA!
具身智能之心· 2025-10-13 00:02
Core Insights - The article discusses the launch of RLinf, a large-scale reinforcement learning framework aimed at embodied intelligence, highlighting its flexibility and efficiency in system design [2][3]. Group 1: System Design - RLinf-VLA provides a unified and efficient platform for VLA+RL research, achieving a throughput improvement of 2.27 times compared to baseline platforms [2][5]. - It supports multiple simulators (LIBERO and ManiSkill), allowing for integrated training across different environments [5]. - The system allows for easy switching between various VLA models and RL algorithms, reducing the workload for model adaptation [5]. Group 2: Performance Overview - A single unified model achieved a success rate of 98.11% across 130 tasks in LIBERO and 97.66% in 25 pick & place tasks in ManiSkill [6]. - The RLinf-VLA framework demonstrates superior zero-shot generalization capabilities when deployed on real robotic systems compared to strategies trained with SFT [6][45]. Group 3: Algorithm Design - The framework introduces several design optimizations, including lightweight critics and trajectory length normalization, which significantly enhance training efficiency [9][21][25]. - It supports three levels of output granularity (token-level, action-level, chunk-level) for both advantage and log-probability calculations, allowing for flexible training strategies [12][14][22]. Group 4: Experimental Results - In multi-task experiments, the OpenVLA model showed performance improvements of 45% to 70% over baseline models in ManiSkill tasks [31]. - The RLinf-VLA framework demonstrated high efficiency in training, with significant reductions in training time compared to baseline methods [43][44]. Group 5: Real-World Application - The RLinf-VLA framework was successfully deployed on the Franka Panda robotic arm, showcasing its ability to generalize from simulation to real-world tasks [45].
没有导师指导,最快多久可以产出一篇具身领域相关论文?
具身智能之心· 2025-09-28 07:00
Core Insights - The article emphasizes the importance of building a solid foundation in research before diving into complex topics like VLA (Vision-Language-Action) in embodied intelligence [1][6] - VLA is highlighted as a transformative model that allows robots to perform tasks based on language instructions, breaking the limitations of traditional single-task training [4][7] - The article discusses the rapid development of the embodied intelligence sector, with various teams transitioning from research to commercialization, and major tech companies actively investing in this field [6] Summary by Sections VLA Overview - VLA enables robots to autonomously make decisions in diverse environments, significantly enhancing their adaptability and application across industries such as manufacturing and logistics [4][6] - The model has become a research hotspot, fostering collaboration between academia and industry through various projects like pi0, RT-2, and OpenVLA [4][7] Industry Development - The embodied intelligence field is experiencing robust growth, with companies like Unitree, Zhiyuan, and major tech players like Huawei and Tencent making significant strides [6] - There is a growing interest in VLA-related research, with many seeking guidance to quickly enter or transition within this domain [6] Course Offerings - A specialized course on VLA research is introduced, focusing on the theoretical and practical aspects of embodied intelligence, including simulation environment setup and experimental design [10][12] - The course aims to cultivate independent research capabilities, guiding students from idea generation to the completion of a research paper [12][17] Learning Outcomes - Participants will gain comprehensive knowledge of VLA models, practical experience in simulation, and skills in academic writing and research methodology [17] - The course is designed to help students identify research opportunities and navigate the complexities of the embodied intelligence landscape [12][16]
VLA的论文占据具身方向的近一半......
具身智能之心· 2025-09-18 04:00
Core Insights - The article emphasizes the significance of Vision-Language-Action (VLA) models in the field of embodied intelligence, highlighting their ability to enable robots to autonomously make decisions in diverse environments, thus breaking the limitations of traditional single-task training methods [1][4]. Industry Development - The embodied intelligence sector is experiencing rapid growth, with teams like Unitree, Zhiyuan, Xinghaitu, and Yinhai General transitioning from laboratory research to commercialization, alongside major tech companies such as Huawei, JD, and Tencent collaborating with international firms like Tesla and Figure AI [3]. Research Opportunities - VLA is identified as a current research hotspot with many unresolved issues, making it a promising area for academic papers. The article mentions the establishment of a specialized VLA research guidance course aimed at helping individuals quickly enter or transition within this field [3][4]. Course Content and Structure - The course focuses on how agents interact effectively with the physical world through a perception-cognition-action loop, covering the evolution of VLA technology from early grasp pose detection to recent models like Diffusion Policy and multimodal foundational models [7][8]. - It addresses core challenges in embodied intelligence, such as cross-domain generalization and long-term planning, and explores how to integrate large language models with robotic control systems [8]. Learning Outcomes - Upon completion, participants are expected to master the theoretical foundations and technical evolution of VLA models, gain proficiency in simulation environments, and develop independent research capabilities [14]. - The course aims to guide students from idea generation to the completion of a high-quality academic paper, ensuring they can identify research opportunities and design effective experiments [10][14].
卷VLA,提供一些参考方向......
具身智能之心· 2025-09-15 10:00
Core Insights - The Vision-Language-Action (VLA) model represents a new paradigm in embodied intelligence, enabling robots to generate executable actions from language instructions and visual signals, thus enhancing their adaptability to complex environments [1][3]. - VLA breaks the traditional single-task limitations, allowing robots to make autonomous decisions in diverse scenarios, which is applicable in manufacturing, logistics, and home services [3]. - The VLA model has become a research hotspot, driving collaboration between academia and industry, with various cutting-edge projects like pi0, RT-2, OpenVLA, QUAR-VLA, and HumanVLA emerging [3][5]. Industry Development - The embodied intelligence sector is experiencing robust growth, with teams like Unitree, Zhiyuan, Xinghaitu, Galaxy General, and Zhujidongli transitioning from laboratories to commercialization [5]. - Major tech companies such as Huawei, JD.com, and Tencent are actively investing in this field, alongside international firms like Tesla and Figure AI [5]. Educational Initiatives - A specialized VLA research guidance course has been launched to assist students in quickly entering or transitioning into the VLA research area, addressing the complexity of the related systems and frameworks [5]. - The course focuses on the perception-cognition-action loop, providing a comprehensive understanding of VLA's theoretical foundations and practical applications [7][8]. Course Structure and Outcomes - The curriculum covers the entire research process, from theoretical foundations to experimental design and paper writing, ensuring students develop independent research capabilities [15]. - Students will learn to identify research opportunities, analyze unresolved challenges in the field, and receive personalized guidance tailored to their backgrounds and interests [15]. - The course aims to help students produce a complete research idea and a preliminary experimental validation, culminating in a draft of a high-quality academic paper [15][18].
当老师给我指了VLA作为研究方向后......
具身智能之心· 2025-09-10 11:00
Group 1 - VLA (Vision-Language-Action) model represents a new paradigm in embodied intelligence, enabling robots to generate executable actions from language instructions and visual signals, thus enhancing their understanding and adaptability in complex environments [1][3] - The VLA model breaks the limitations of traditional single-task training, allowing robots to make autonomous decisions in diverse scenarios, which is applicable in manufacturing, logistics, and home services [3][5] - The VLA model has become a research hotspot, driving the development of several cutting-edge projects such as pi0, RT-2, OpenVLA, QUAR-VLA, and HumanVLA, fostering collaboration between academia and industry [3][5] Group 2 - The embodied intelligence sector is experiencing rapid growth, with teams like Unitree, Zhiyuan, Xinghaitu, and Yinhai General transitioning from laboratories to commercialization, while tech giants like Huawei, JD.com, and Tencent are actively investing in this field [5] - The course on VLA research aims to equip students with comprehensive skills in academic research, including theoretical foundations, experimental design, and paper writing, focusing on independent research capabilities [13][15] - The curriculum emphasizes identifying research opportunities and innovative points, guiding students to develop their research ideas and complete preliminary experiments [14][15] Group 3 - The course covers the technical evolution of the VLA paradigm, from early grasp pose detection to recent advancements like Diffusion Policy and multimodal foundational models, focusing on end-to-end mapping from visual input and language instructions to robotic actions [8][9] - Core challenges in embodied intelligence, such as cross-domain generalization and long-term planning, are analyzed, along with strategies to combine large language model reasoning with robotic control systems [9] - The course aims to help students master the latest research methods and technical frameworks in embodied intelligence, addressing limitations and advancing towards true general robotic intelligence [9][15]
机器人操控新范式:一篇VLA模型系统性综述 | Jinqiu Select
锦秋集· 2025-09-02 13:41
Core Insights - The article discusses the emergence of Vision-Language-Action (VLA) models based on large Vision-Language Models (VLMs) as a transformative paradigm in robotic manipulation, addressing the limitations of traditional methods in unstructured environments [1][4][5] - It highlights the need for a structured classification framework to mitigate research fragmentation in the rapidly evolving VLA field [2] Group 1: New Paradigm in Robotic Manipulation - Robotic manipulation is a core challenge at the intersection of robotics and embodied AI, requiring deep understanding of visual and semantic cues in complex environments [4] - Traditional methods rely on predefined control strategies, which struggle in unstructured real-world scenarios, revealing limitations in scalability and generalization [4][5] - The advent of large VLMs has provided a revolutionary approach, enabling robots to interpret high-level human instructions and generalize to unseen objects and scenes [5][10] Group 2: VLA Model Definition and Classification - VLA models are defined as systems that utilize a large VLM to understand visual observations and natural language instructions, followed by a reasoning process that generates robotic actions [6][7] - VLA models are categorized into two main types: Monolithic Models and Hierarchical Models, each with distinct architectures and functionalities [7][8] Group 3: Monolithic Models - Monolithic VLA models can be implemented in single-system or dual-system architectures, integrating perception and action generation into a unified framework [14][15] - Single-system models process all modalities together, while dual-system models separate reflective reasoning from reactive behavior, enhancing efficiency [15][16] Group 4: Hierarchical Models - Hierarchical models consist of a planner and a policy, allowing for independent operation and modular design, which enhances flexibility in task execution [43] - These models can be further divided into Planner-Only and Planner+Policy categories, with the former focusing solely on planning and the latter integrating action execution [43][44] Group 5: Advancements in VLA Models - Recent advancements in VLA models include enhancements in perception modalities, such as 3D and 4D perception, as well as the integration of tactile and auditory information [22][23][24] - Efforts to improve reasoning capabilities and generalization abilities are crucial for enabling VLA models to perform complex tasks in diverse environments [25][26] Group 6: Performance Optimization - Performance optimization in VLA models focuses on enhancing inference efficiency through architectural adjustments, parameter optimization, and inference acceleration techniques [28][29][30] - Dual-system models have emerged to balance deep reasoning with real-time action generation, facilitating smoother deployment in real-world scenarios [35] Group 7: Future Directions - Future research directions include the integration of memory mechanisms, 4D perception, efficient adaptation, and multi-agent collaboration to further enhance VLA model capabilities [1][6]