视觉 - 语言 - 动作(VLA)模型
Search documents
华科&清华最新DeepThinkVLA:如何让模型 “会思考、能落地”?
具身智能之心· 2025-11-24 10:02
Core Insights - The article presents DeepThinkVLA, a new model that addresses the challenges in the visual-language-action (VLA) domain by integrating a mixed attention decoder and a two-stage training pipeline, achieving a task success rate of 97.0% on the LIBERO benchmark, setting a new performance standard for VLA models [2][14]. Group 1: Model Architecture - DeepThinkVLA resolves the "modal conflict" between reasoning and action by employing a mixed attention mechanism that allows for efficient processing of both modalities within a single decoder [4][10]. - The model features a dynamic switching mechanism between causal attention for reasoning generation and bidirectional attention for action generation, significantly reducing inference latency and enhancing performance [4][10]. Group 2: Training Methodology - The training process consists of a two-stage pipeline combining supervised fine-tuning (SFT) and reinforcement learning (RL), which enhances the model's reasoning capabilities while ensuring effective action execution [6][8]. - The SFT phase focuses on building foundational reasoning skills through a carefully designed data augmentation pipeline, resulting in a dataset of 273,465 annotated frames [10][12]. Group 3: Innovations and Mechanisms - Two key innovations are highlighted: the probabilistic decomposition of reasoning and action, and an error recovery mechanism that allows the model to self-correct during execution [10][11]. - The reward design incorporates task-success rewards and format regularization rewards, focusing on the final success of tasks while minimizing interference from intermediate reasoning semantics [11][12]. Group 4: Performance Evaluation - DeepThinkVLA outperforms existing models across various tasks, achieving an average success rate of 97.0%, with specific task success rates of 99.0% for Object tasks and 96.4% for Goal tasks [14][15]. - The model demonstrates superior robustness compared to top autoregressive models, showcasing its effectiveness in complex robotic operations [15][16]. Group 5: Future Directions - Future enhancements may include integrating additional sensory data, expanding to more complex collaborative tasks, optimizing efficiency, and constructing larger datasets to improve model generalization [23][24].
真机RL!最强VLA模型π*0.6来了,机器人在办公室开起咖啡厅
机器之心· 2025-11-18 03:30
完全使用真实世界数据训练的具身智能,具备什么级别的能力? 机器之心报道 编辑:泽南、冷猫 新方法大幅提升了具身智能的成功率、处理效率。 本周,美国具身智能创业公司 Physical Intelligence(简称 PI 或 π)发布了旗下的 最新机器人 基础模型 π*0.6 。 PI 是一家总部位于旧金山的机器人与 AI 创业公司,其使命是将通用人工智能从数字世界带入物理世界:他们的首个机器人通用基础模型名为 π₀,让同一套软件控 制多种物理平台执行各类任务。 在 2024 年,PI 获得超过 4 亿美元融资,估值突破 20 亿美元,成为具身智能赛道最受瞩目的玩家之一。 PI 的技术路线强调 「视觉 - 语言 - 动作」(VLA)模型,通过大规模机器人感知与动作数据训练出具备泛化能力的策略,使机器人不再局限于预设动作,而能在 未知环境中灵活执行。 机器学习与决策控制领域的知名专家、UC Berkeley 副教授、Physical Intelligence 联合创始人 Sergey Levine 表示,搭载这个模型的机器人已经可以在公司的办公室 里为人们制作拿铁、美式和意式咖啡了。 Sergey Levine ...
Dexmal原力灵机发布实时VLA模型!消费级显卡上完成pi0模型30Hz以上推理
具身智能之心· 2025-11-04 00:05
Core Insights - The article discusses the development of a real-time visual-language-action (VLA) model that achieves a significant reduction in inference time, enabling dynamic tasks such as object grasping to be performed effectively [3][6][23]. Optimization Strategies - The research outlines a comprehensive optimization pipeline that reduces inference time from over 100ms to 27.3ms for a two-view model, achieved through four main steps: eliminating basic overhead, simplifying the computation graph, optimizing kernel depth, and tuning GEMM parameters [7][18][22]. - The first step involves removing CPU overhead by utilizing CUDA Graphs, which reduces inference time from 106.5ms to approximately 53.9ms [9][10]. - The second step simplifies the computation graph, further reducing inference time to about 45.8ms [12][14]. - The third step focuses on optimizing kernel depth, which includes techniques like weight folding and merging operations to enhance performance [15][18]. Performance Validation - The article employs the roofline model to assess the theoretical lower bound of performance, indicating that the actual inference time of 27.3ms is only 30% higher than the theoretical limit of 20.6ms, suggesting that the optimizations are close to hardware limits [20][22]. - The synchronization overhead is also analyzed, showing significant reductions when using optimized methods compared to naive implementations [21][24]. Real-World Application - A real-world experiment involving the grasping of a falling pen demonstrates the model's effectiveness, achieving a 100% success rate in trials, which highlights the model's capability to meet stringent timing constraints [36][37]. - The framework allows for high-frequency control, with the potential to run 30 VLA models at 30Hz and 480 action experts at 480Hz, showcasing its applicability in dynamic robotic tasks [31][32]. Future Directions - The article suggests future research directions, including exploring larger model sizes and finer-grained feedback loops to enhance performance and adaptability in real-time applications [37].
智源研究院开源单图高精度6D位姿估计方法
Bei Jing Shang Bao· 2025-10-27 13:04
Core Insights - The core viewpoint of the article is the announcement by the Beijing Academy of Artificial Intelligence (BAAI) regarding the open-source release of a high-precision 6D pose estimation method that enables robots to understand unfamiliar objects with a single image [1] Group 1: Technological Advancements - Traditional 6D pose estimation methods rely heavily on high-quality CAD models or multi-view reconstruction, which are inadequate for dynamic and real-time applications [1] - Existing single-image inference methods are limited by scale, appearance, and pose ambiguity, hindering their effectiveness in precise operational scenarios [1] Group 2: New Methodology - The new method, named OnePoseViaGen, does not require pre-set 3D models and relies solely on a single RGBD reference image to achieve high-precision 6D pose estimation on unknown objects [1] - The related paper titled "One View, Many Worlds: Single-Image to 3D Object Meets Generative Domain Randomization for One-Shot 6D Pose Estimation" has been selected for oral presentation at CoRL 2025 [1]
港科大最新!超越人类示范:基于扩散的强化学习为VLA训练生成 “高质量、低方差“ 数据
具身智能之心· 2025-10-23 04:00
Core Insights - The article discusses the limitations of traditional human demonstration data in training Visual-Language-Action (VLA) models and introduces a novel diffusion-based reinforcement learning (RL) approach to generate high-quality training data [2][5]. Group 1: VLA Model and Data Generation - VLA models integrate visual, language, and action information, but their performance is often constrained by the quality and scale of manually collected data [5]. - The proposed diffusion RL algorithm offers a semi-automated method for high-quality data collection suitable for VLA training, enhancing model performance [5]. Group 2: Methodology and Results - The study presents an improved diffusion strategy optimization algorithm that generates high-quality, low-variance trajectories for VLA training [2]. - Evaluation on the LIBERO benchmark, which includes 130 long-horizon tasks, shows that the generated trajectories are smoother and more consistent than human demonstration data and outperform standard Gaussian RL-generated trajectories [2]. - Training VLA models solely on data generated by diffusion RL achieves an average success rate of 81.9%, which is a 5.3 percentage point improvement over human data and a 12.6 percentage point improvement over Gaussian RL data [2]. Group 3: Key Highlights - The article emphasizes the potential of RL-driven robot trajectory generation and the adaptability of the general RL framework to any VLA architecture [6]. - It highlights the performance breakthroughs that exceed human demonstrations, showcasing the effectiveness of the proposed approach [6].
RLINF-VLA:一种用于 VLA+RL 训练的统一高效框架
具身智能之心· 2025-10-22 06:02
Core Insights - The article presents the RLinf-VLA framework, a unified and efficient framework for training Visual-Language-Action (VLA) models using Reinforcement Learning (RL), addressing the limitations of existing models that rely on supervised fine-tuning [2][53] - The framework significantly enhances training efficiency and generalization capabilities, achieving high success rates in various simulation tasks and demonstrating superior performance in real-world applications compared to traditional supervised methods [5][53] Framework Design - The RLinf-VLA framework integrates multiple simulators, algorithms, and VLA architectures, optimizing resource allocation through flexible execution modes and system-level enhancements [4][53] - It supports three GPU allocation strategies: colocated, disaggregated, and hybrid, allowing users to easily switch modes via configuration files, thus reducing system customization costs [10][11] Model Compatibility - The framework supports LoRA for efficient parameter tuning, reducing memory consumption and accelerating training while maintaining performance [12] - It is compatible with OpenVLA and its extension OpenVLA-OFT, which have shown strong performance in various robotic operation benchmarks [12][22] Multi-Simulator Support - The framework emphasizes the importance of simulators in RL, utilizing ManiSkill and LIBERO as primary simulators to achieve diverse task capabilities [13] - It provides a unified interface for different simulators, facilitating the implementation of various tasks and supporting multiple RL algorithms, initially focusing on PPO and GRPO [13][14] Algorithm Design - The framework incorporates advanced techniques for advantage function and log-probability calculations, allowing for flexible integration of block-level and action-level definitions [14][15] - It supports various optimization strategies, including trajectory length normalization and effective action masking, to enhance training stability and performance [19][20] Experimental Results - The RLinf-VLA framework demonstrated significant performance improvements, with success rates increasing by 45% to 70% in various tasks compared to baseline models [22][24] - In LIBERO tasks, the framework achieved an average success rate of 98.11%, showcasing its capability for large-scale multi-task reinforcement learning [28] High Efficiency Performance - The framework's efficiency is evaluated based on throughput, achieving substantial improvements in training speed across different GPU configurations [30][35] - The hybrid allocation mode outperformed traditional methods, demonstrating the benefits of pipeline overlapping in resource utilization [35][37] Real-World Deployment - The RLinf-VLA framework was successfully deployed in real-world environments, showing superior zero-shot generalization capabilities compared to supervised fine-tuning strategies [51][53] - The experiments indicated that RL-trained models could adapt better to real-world tasks, achieving higher success rates in object manipulation tasks [51] Conclusion - The RLinf-VLA framework represents a significant advancement in the field of embodied intelligence, providing a robust foundation for future research and development in VLA training [53]
会自检的VLA!ReflectDrive:更安全更高效scaling的端到端框架(理想&清华)
自动驾驶之心· 2025-09-27 23:33
Core Viewpoint - ReflectDrive is a novel learning framework that integrates a reflective mechanism to achieve safe trajectory generation through discrete diffusion, addressing the challenges in end-to-end autonomous driving systems [4][46]. Group 1: Introduction and Background - Autonomous driving is leading the transportation industry towards a safer and more efficient future, with end-to-end (E2E) systems becoming a mainstream alternative to traditional modular designs [4]. - Visual-Language-Action (VLA) models combine pre-trained knowledge from visual-language models (VLM) to enhance adaptability in complex scenarios [4][5]. - Current learning-based methods have not resolved core challenges in imitation learning driving systems, particularly in encoding physical rules like collision avoidance [4][5]. Group 2: ReflectDrive Framework - ReflectDrive proposes a new learning framework that utilizes a discrete diffusion reflective mechanism for safe trajectory generation [3][12]. - The framework begins by discretizing the two-dimensional driving space to construct an action codebook, allowing fine-tuning of pre-trained diffusion language models for planning tasks [3][14]. - The reflective mechanism operates without gradient calculations, enabling iterative self-correction inspired by spatiotemporal joint planning [3][8]. Group 3: Methodology and Mechanism - The reflective inference process consists of two stages: target condition trajectory generation and safety-guided regeneration [20][25]. - The framework integrates safety metrics to evaluate generated multimodal trajectories, identifying unsafe path points through local search methods [8][25]. - The iterative optimization loop continues until the trajectory is deemed safe or computational limits are reached, ensuring high efficiency in real-time performance [31][32]. Group 4: Experimental Results - ReflectDrive was evaluated on the NAVSIM benchmark, demonstrating significant improvements in safety metrics such as collision rates and compliance with drivable areas [32][38]. - The introduction of the safety-guided regeneration mechanism led to substantial enhancements in safety indicators, with notable increases in DAC (3.9%), TTC (1.3%), NC (0.8%), and EP (7.9%) compared to the baseline [37][38]. - When using ground-truth agent information, ReflectDrive's performance approached human driving levels, achieving NC of 99.7% and DAC of 99.5% [38][39]. Group 5: Conclusion - ReflectDrive effectively integrates a reflective mechanism with discrete diffusion for safe trajectory generation, validated by its performance on the NAVSIM benchmark [46].
当机器人学会 “模仿” 人类:RynnVLA-001 如何突破操作数据稀缺困境?
具身智能之心· 2025-09-22 00:03
Core Insights - The article discusses the development of a new VLA model, RynnVLA-001, by Alibaba DAMO Academy, which addresses the scarcity of high-quality operational data in robot manipulation by utilizing human demonstration data [1][5][35]. Group 1: Model Overview - RynnVLA-001 leverages 12 million ego-centered human operation videos and employs a two-stage pre-training strategy to teach robots human operational logic and action trajectories [1][2][5]. - The model achieves an average success rate of 90.6% in various tasks, significantly outperforming existing models like GR00T N1.5 and Pi0, which have lower success rates [2][15]. Group 2: Methodology - The training process consists of three core stages: ego-centered video generation pre-training, human-centered trajectory perception modeling, and robot-centric visual-language-action modeling [7][10][11]. - The introduction of ActionVAE optimizes action representation by compressing action sequences into compact latent embeddings, enhancing the model's ability to predict smooth and coherent actions [6][13][24]. Group 3: Experimental Results - RynnVLA-001 demonstrates superior performance across multiple tasks, achieving success rates of 90.0% for picking and placing green blocks, 91.7% for strawberries, and 90.0% for pen placement [15][17]. - In complex scenarios with distractors, RynnVLA-001 maintains a high success rate of 91.7%, showcasing its robustness in instruction-following tasks [18][19]. Group 4: Pre-training Effectiveness - The two-stage pre-training process is validated through ablation studies, showing that models without video pre-training perform poorly, while those with it exhibit significant improvements in task success rates [19][20][21]. - The model's ability to predict human trajectories effectively bridges the gap between visual prediction and action generation, leading to enhanced performance [21][22]. Group 5: Limitations and Future Directions - Current testing is limited to the LeRobot SO100 robotic arm, indicating a need for broader applicability across different robotic platforms [41]. - Future work should focus on improving environmental generalization and exploring dynamic camera perspectives to enhance robustness [41].
TrajBooster:首个全身人行操作VLA方案,跨构型解决数据难题(代码全开源)
具身智能之心· 2025-09-18 00:03
Core Insights - The article discusses the TrajBooster framework, which aims to enhance the capabilities of humanoid robots by utilizing a trajectory-centric learning approach, enabling them to perform complex household tasks with minimal training data [2][40]. Group 1: Research Background and Challenges - The development of humanoid robots faces two main challenges: the unique difficulties of maintaining dynamic balance while performing upper body tasks, and the scarcity of high-quality training data necessary for effective VLA model training [3][4]. - Existing methods rely on expensive equipment and expert operators, resulting in limited data sets that do not adequately cover the diverse action spaces required for humanoid robots [4]. Group 2: TrajBooster Framework - TrajBooster utilizes a three-step process: real trajectory extraction, simulation redirection, and dual-stage fine-tuning, allowing for the conversion of extensive wheeled robot data into effective training resources for bipedal robots [5][40]. - The framework significantly reduces the dependency on costly data from similar robot types, enabling zero-shot skill transfer and improving the robustness and generalization of the VLA models [2][5]. Group 3: Methodology - The framework begins with extracting real trajectories from the Agibot-World Beta dataset, which contains over 1 million real robot trajectories, and then maps this data to the Unitree G1 robot's operational space [7][9]. - A hierarchical composite model is employed to decouple control into upper and lower body systems, enhancing the efficiency of whole-body manipulation [11][12]. Group 4: Experimental Results - TrajBooster demonstrated superior performance in various tasks, achieving the lowest position error (2.851 cm) and rotation error (6.231 degrees) in mobile scenarios, validating the advantages of hierarchical training and coordinated online DAgger [27]. - The framework's ability to adapt to unseen tasks was evidenced by its success in a "water transfer" task, which was not included in the training data, showcasing improved generalization capabilities [39][40]. Group 5: Limitations and Future Directions - The current implementation is limited by the precision of the Unitree Dex-3 hand, which only supports simple grasping tasks; future work will focus on integrating dexterous hands with tactile sensing for more complex manipulations [41]. - There is a need to address the visual input discrepancies and expand the framework to include mobile manipulation data, as the current research is primarily focused on static tasks [43][44].
SimpleVLA-RL:突破 VLA 模型训练瓶颈,RL实现端到端在线训练
具身智能之心· 2025-09-15 00:04
Core Insights - The article discusses the development of the SimpleVLA-RL framework, which enhances the training of Visual-Language-Action (VLA) models in robotics through reinforcement learning (RL) techniques, addressing the limitations of traditional supervised fine-tuning (SFT) methods [2][4][30] Group 1: Research Background and Challenges - VLA models are crucial for integrating visual perception, language understanding, and action generation in robotic control, but current training methods face significant challenges, including data scarcity and weak generalization capabilities [2][5] - The breakthrough in large reasoning models suggests that RL can improve the sequential action planning capabilities of VLA models, but traditional RL methods are limited by manual reward design and the high cost of environmental interactions [2][5] Group 2: Contributions of SimpleVLA-RL - SimpleVLA-RL is designed specifically for VLA, incorporating interactive trajectory sampling and multi-environment parallel rendering, which significantly reduces training costs and improves scalability [6][9] - The framework has achieved state-of-the-art (SOTA) performance across multiple benchmarks, with notable improvements in success rates, such as LIBERO's average success rate increasing from 91.0% to 99.1% [6][12] - SimpleVLA-RL demonstrates enhanced data efficiency, achieving a LIBERO average success rate of 96.9% with only one demonstration trajectory, surpassing traditional methods [16][17] Group 3: Generalization and Real-World Application - The framework shows robust generalization capabilities across unseen tasks, with significant performance improvements in various scenarios, indicating its ability to learn universal skills rather than overfitting to specific data [22][30] - SimpleVLA-RL has proven effective in sim-to-real transfer, with real-world task success rates improving from 17.5% to 38.5%, validating its deployment capabilities [7][21] Group 4: Key Discoveries - The framework has led to the discovery of the "Pushcut" phenomenon, where the RL-trained model autonomously develops more efficient strategies beyond human demonstrations, showcasing the potential for innovative robotic behaviors [24][30] - The effectiveness of SimpleVLA-RL is contingent on the initial model capabilities, with significant performance enhancements observed when starting from a higher baseline success rate [28][29]