Workflow
视觉-语言-动作(VLA)模型
icon
Search documents
直指具身智能核心瓶颈,千寻智能高阳团队提出 Point-VLA:首次以视觉定位实现语言指令精准执行
机器之心· 2026-03-31 02:59
Core Insights - The article discusses the limitations of traditional Vision-Language-Action (VLA) models in accurately interpreting complex spatial instructions and proposes a new method called Point-VLA to overcome these challenges [5][27]. Group 1: Limitations of Traditional VLA Models - Language often fails to express certain spatial scenarios accurately, leading to ambiguity in communication [6][8]. - Even when detailed descriptions are provided, VLA models struggle to generalize and execute complex spatial commands, resulting in low success rates [7][20]. - Advanced Visual-Language Models (VLM) can achieve 60-70% accuracy in locating targets based on complex text descriptions, but text-only VLA models have a success rate of only around 25% [14][9]. Group 2: Introduction of Point-VLA - Point-VLA introduces visually grounded instructions by overlaying bounding boxes on images, allowing robots to understand commands more intuitively, similar to human pointing [10][11]. - This method combines high-level intentions expressed in language with precise spatial information encoded visually, enhancing the model's performance [12][15]. Group 3: Experimental Results - Point-VLA achieved an impressive average success rate of 92.5% across various challenging tasks, significantly outperforming the 32.4% success rate of traditional text-only VLA models [20][19]. - In specific tasks, such as cluttered scene grasping, Point-VLA's success rate improved from 43.3% to 94.3%, demonstrating its effectiveness in real-world applications [20][23]. Group 4: Data Annotation and Scalability - The development of an automated data annotation pipeline allows for efficient generation of visual grounding signals, reducing the cost of acquiring training data [18][27]. - As training data increases, Point-VLA's performance continues to improve, while traditional text-only VLA models reach a performance plateau [25][30]. Group 5: Implications for Future Development - Point-VLA addresses a fundamental issue in the VLA field by bypassing the limitations of language expression, paving the way for new advancements in VLA models [27]. - The demonstrated capabilities of Point-VLA provide a technical foundation for practical applications in industrial and service sectors, highlighting the effectiveness of human-like interaction methods in human-robot collaboration [27][29].
具身大模型LaST₀:双臂/移动/灵巧手全面新SOTA,首次引入隐空间时空思维链
量子位· 2026-02-07 07:02
Core Insights - The article introduces LaST₀, a novel VLA model that utilizes Latent Spatio-Temporal CoT for efficient reasoning in robotics, achieving state-of-the-art performance in various tasks [1][2][4]. Group 1: Model Overview - LaST₀ integrates high-efficiency latent space reasoning into embodied large models, surpassing previous methods like Pi0.5 in dual-arm and humanoid dexterous hand tasks [2][4]. - The model employs a Mixture-of-Transformers (MoT) architecture, featuring a slow reasoning expert for low-frequency latent space reasoning and a fast action expert for high-frequency action generation [5][11]. Group 2: Technical Innovations - LaST₀ introduces a compact latent space to model future visual dynamics, 3D structural information, and robot proprioceptive states, enabling a coherent temporal reasoning process [4][10]. - The model's architecture allows for asynchronous frequency coordination between the slow reasoning expert and the fast execution expert, optimizing real-time robotic operations [23]. Group 3: Performance Metrics - In simulations, LaST₀ achieved an average success rate of 82% across 10 RLBench tasks, outperforming existing state-of-the-art methods by 8% to 21% [24]. - In real-world tasks, LaST₀ demonstrated a 72% average success rate on the Franka platform, significantly exceeding competitors like SpatialVLA (41%) and CoT-VLA (50%) [27]. Group 4: Implications for Robotics - The model's ability to capture intricate physical and dynamic features through latent space reasoning enhances its performance in complex robotic tasks, indicating its potential for broader applications in dynamic environments [9][28]. - LaST₀'s design allows for effective interaction with the physical world, crucial for robust robotic operations in various settings [9][12].
别让vision拖累VLA中的action!
具身智能之心· 2025-12-20 01:02
Core Insights - The article discusses the challenges and advancements in Visual-Language-Action (VLA) models used in robotics, particularly focusing on the limitations of existing models that rely on low-dimensional sparse action signals to supervise high-dimensional dense visual inputs, which restricts overall performance [6][9]. Research Background - VLA models have shown significant progress but still face issues due to the mismatch between action supervision signals and visual inputs, leading to underutilization of the model's representation capabilities [6]. - The introduction of a visual prediction mechanism is proposed to enhance action generation by predicting future visual states, although high-dimensional visual states often contain redundant information that complicates the training process [8]. Proposed Solutions - Decoupled Visual Forecasting (DVF) is introduced to alleviate the burden on the backbone network by automatically capturing implicit actions and enhancing explicit action generation [7]. - A progressive pre-training approach is suggested to gradually integrate different modalities, introducing language supervision to retain the understanding and reasoning capabilities of the VLA backbone [7]. - Adaptive Temporal Ensemble (ATE) is proposed to dynamically adjust the integration strength during inference, reducing computational costs while maintaining action stability [14]. Architecture Design - The DVF method incorporates implicit action queries and a separate diffusion DVF head, allowing the model to focus on frame-to-frame differences rather than predicting complete future frames [10]. - A progressive training scheme is designed to introduce visual, language, and action information in phases to avoid competition between modalities and achieve stable optimization [10]. Experimental Analysis - Mantis, the proposed model, outperforms existing baseline methods in three out of four tasks on the LIBERO benchmark, achieving the highest average success rate of 96.7% [16][18]. - The convergence speed of Mantis is significantly faster compared to traditional visual prediction methods like UnifiedVLA [20]. - Experiments demonstrate the effectiveness of language supervision in retaining the backbone's capabilities, with Mantis outperforming in both in-domain and out-of-domain instruction tasks [20]. Team Introduction - The research team, SJTU Deng Lab, focuses on generative models and large language models, collaborating with renowned institutions and maintaining a strong research output in top-tier journals and conferences [23].
EVOLVE-VLA:VLA模型测试时训练,突破模仿学习瓶颈
具身智能之心· 2025-12-18 00:07
Group 1 - The core challenge of existing Vision-Language-Action (VLA) models is the limitation of the supervised fine-tuning (SFT) paradigm, which contrasts with human learning that emphasizes practice and feedback [2][3] - The proposed solution involves a test-time training (TTT) framework that allows VLA models to learn continuously through environmental interaction, addressing the lack of Oracle reward signals during deployment [4][6] Group 2 - The innovative aspects of the framework include a test-time autonomous feedback mechanism using a pre-trained progress estimator (VLAC) to provide dense feedback signals, and strategies to tame noise signals inherent in the progress estimator [4][6] - The method framework models robot operation tasks as a Markov Decision Process (MDP) and incorporates mechanisms for cumulative progress estimation and progressive horizon expansion to enhance learning robustness [6][7] Group 3 - Experimental results demonstrate that EVOLVE-VLA achieves an average success rate of 95.8%, a 6.5% improvement over the SFT baseline, with significant enhancements in long-horizon tasks [16][18] - In low-data scenarios, EVOLVE-VLA shows a 17.7% increase in success rate, reaching 61.3% with only one demonstration, highlighting its effectiveness in reducing data collection costs [19][20] Group 4 - The framework exhibits cross-task generalization capabilities, achieving a success rate of 20.8% in zero-shot task transfer after autonomous exploration, marking a significant advancement in task adaptability [22] - The qualitative analysis reveals emergent capabilities not present in the demonstration data, such as error recovery and state adaptation, showcasing the model's flexibility [25][27] Group 5 - The study identifies limitations such as misalignment between progress estimation and environmental success criteria, which can lead to reward hacking or misjudgment [33] - Future directions include optimizing the reward model for better alignment, accelerating real-time deployment, and enhancing zero-shot generalization capabilities [34]
GLaD:知识蒸馏将3D几何先验注入VLA模型,任务成功率突破94%
具身智能之心· 2025-12-12 01:22
Group 1 - The core viewpoint of the article is the introduction of the GLaD framework, which integrates 3D geometric priors into Vision-Language-Action (VLA) models to enhance their performance in robotic control tasks without the need for additional depth sensors or 3D annotations [2][4][28] - The existing VLA models primarily rely on 2D visual encoders, which limits their ability to understand 3D spatial information, leading to inaccuracies in task execution [2][4] - GLaD's architecture consists of a geometric distillation module and a staged training strategy, allowing for the effective integration of geometric knowledge into the VLA model [7][10] Group 2 - The geometric distillation module is the core innovation of GLaD, aligning the hidden states of visual tokens in the LLM with features from a geometric perception teacher model, thus achieving deep integration of geometric knowledge [9][10] - The training strategy is divided into two phases: the first phase focuses on geometric distillation pre-training using the Bridge dataset, while the second phase fine-tunes the model for downstream tasks like LIBERO [12][13] - GLaD achieved an average success rate of 94.1% on the LIBERO benchmark, outperforming other baseline models such as UniVLA and OpenVLA [14][16] Group 3 - The LIBERO benchmark consists of 130 language-conditioned operation tasks divided into four suites, assessing various aspects of model performance, including spatial knowledge transfer and long-range task capabilities [17][19] - GLaD demonstrated significant robustness in object perturbation scenarios, achieving a success rate of 81% in the GOAL suite, compared to 62% for UniVLA [16][19] - Ablation studies confirmed the effectiveness of GLaD's key design choices, showing that late-stage alignment of the LLM's final layer significantly improves task performance [20][26] Group 4 - The article discusses the core value of geometric understanding, highlighting that GLaD's ability to focus on task-relevant objects is a key factor in its high success rates [23][25] - The choice of the VGGT geometric encoder over other encoders resulted in a 29.8 percentage point improvement in the SPATIAL suite, demonstrating its suitability for spatial reasoning tasks [25][26] - Future directions include exploring more precise spatial relationship modeling to address current limitations in spatial layout generalization [27][28]
LatBot:中科院团队提出潜在动作蒸馏,提升机器人VLA小样本迁移效率
具身智能之心· 2025-12-04 00:04
Group 1 - The core viewpoint of the article emphasizes the importance of latent action learning in vision-language-action (VLA) models, aiming to extract compressed motion semantics from continuous frames to create a universal representation independent of robotic entities [2] - Existing latent action models (LAM) face three main challenges: lack of task instruction guidance, insufficient utilization of multi-frame information, and an overemphasis on visual appearance changes without physical perception [2] Group 2 - The proposed method involves decoupling latent action representation into two complementary learnable tokens: scene tokens for capturing passive environmental changes and motion tokens for encoding active robotic movements [4][7] - A unified decoder is designed to condition latent actions, jointly guiding future frame reconstruction and inter-frame action generation, initialized from a pre-trained image generation model [5] Group 3 - The knowledge distillation strategy for transferring latent action knowledge to the VLA model includes two loss components: latent action alignment loss and reasoning retention loss, ensuring the student model learns physical perception while retaining reasoning capabilities [8][9] - The overall distillation objective balances latent action alignment and reasoning retention, with a specific focus on fine-tuning to convert latent representations into executable robotic actions [9] Group 4 - Experimental results demonstrate the proposed framework's superior performance in both simulation and real robot environments, particularly in few-shot transfer scenarios across five complex tasks [10][12] - The combination of decoupled latent action representation and unified action decoder significantly enhances success rates, validating the effectiveness of the design [13] Group 5 - The article concludes that through task instruction guidance, multi-frame input utilization, and the integration of physical priors, a universal and transferable latent action representation can be learned [18] - Future directions include extracting additional latent tokens from larger and more diverse operation videos to further expand the VLA model's capabilities in complex, long-range, multi-entity robotic tasks [18]
E0:离散扩散新框架,大幅提升 VLA 模型泛化与操控精度
具身智能之心· 2025-11-29 02:07
Group 1 - The article discusses the need for robots to possess three core capabilities for operation in open environments: complex visual scene perception, natural language instruction understanding, and precise action generation [1][3] - Existing methods face significant bottlenecks, including insufficient generalization ability, coarse action control, and modeling paradigm contradictions [3][4] - The proposed framework introduces a continuous action discretization strategy, enhancing the stability of robot inference and allowing for fine-grained control [6][8] Group 2 - The architecture utilizes the PaliGemma open-source VLM as a backbone, adding a 300 million parameter action expert network to optimize action generation through a diffusion model [6][10] - The training process involves multi-modal observation encoding, action discretization, and Gaussian noise addition to ensure temporal consistency [8][9] - The inference process includes initializing a noise action sequence, multi-step denoising, and deterministic de-discretization to produce executable action blocks [10][11] Group 3 - The model achieves state-of-the-art (SOTA) performance across three benchmarks (LIBERO, VLABench, ManiSkill), with an average success rate exceeding baseline by 10.7% [21] - In the LIBERO benchmark, the model achieved an average success rate of 96%, demonstrating superior grasping and instruction execution capabilities [21] - The model also excels in high-precision tasks, achieving an average success rate of 55.2% in the ManiSkill benchmark, significantly outperforming baseline models [24][28] Group 4 - The article identifies limitations such as insufficient semantic alignment for specific tasks, challenges in complex coordination tasks, and inadequate modeling of mechanical interactions [32][35] - Future directions include enhancing cross-modal alignment for semantic-rich tasks, designing adaptive task sampling strategies, and integrating physical model priors to improve control precision [35]
新国立提出VLA-4D:4D感知VLA模型,实现时空连贯的机器人操作
具身智能之心· 2025-11-25 00:03
Core Concept - The article introduces the 4D perception VLA model, which aims to enhance the spatial and temporal coherence of robotic operations by integrating spatial and temporal information, thereby improving visual reasoning and action planning [2][4]. Group 1: Model Design and Technical Details - The VLA-4D model innovates through dual spatial-temporal fusion, embedding 4D (3D space + 1D time) information into visual representations for reasoning and incorporating time variables into action representations for planning [5]. - The 2D VLA model relies on single-frame image input, leading to rough visual reasoning and spatial inaccuracies, while the 3D VLA model lacks explicit temporal modeling, resulting in motion stuttering [6]. - A "4D embedding + cross-attention fusion" representation method is designed to address the lack of spatial-temporal precision in visual reasoning [7][10]. Group 2: Dataset and Training Process - The existing VLA dataset lacks temporal action annotations, prompting an expansion based on the LIBERO dataset, which includes 40 sub-tasks and 150,000 visual-language-action samples [15][16]. - A two-stage training process significantly improves task success rates and reduces execution times compared to single fine-tuning [17][18]. Group 3: Experimental Validation and Key Findings - In the LIBERO benchmark, the VLA-4D model outperforms state-of-the-art models with a success rate of 97.4% and an average completion time of 5.8 seconds across various tasks [19][21]. - The model demonstrates superior generalization capabilities in zero-shot tasks, maintaining higher success rates and shorter execution times [20]. - Ablation studies confirm the necessity of visual representation modules, showing that the combination of spatial and temporal embeddings enhances success rates and reduces completion times [24][27].
南洋理工大学提出NORA-1.5:一种基于世界模型与动作奖励的VLA模型
具身智能之心· 2025-11-21 00:04
Core Insights - The article discusses the introduction of NORA-1.5, a Vision-Language-Action (VLA) model that integrates flow-matching action experts and reward-driven Direct Preference Optimization (DPO) to address the issues of generalization and reliability in existing VLA models [1][3]. Architecture and Key Issues - The architecture focuses on the collaborative optimization of flow matching and the VLA backbone, targeting the pain points of reliability and generalization in real-world environments [3]. - The core solution involves adding flow-matching action experts on top of the pre-trained NORA backbone, along with a dual-component reward model and DPO post-training [3]. Flow-Matching Action Experts - The independent action expert directly regresses action sequences based on visual-language encoded key-value pairs from the VLA backbone, minimizing the difference between predicted and actual speeds [5]. - The dual-component reward mechanism balances goal orientation and stability, with core rewards derived from world model-guided target rewards and real action deviation rewards [6][9]. Training Process - The training process consists of two phases: joint training of action experts and DPO post-training [7]. - The model utilizes the Qwen-2.5-VL-3B visual language model and the Open X-Embodiment dataset for pre-training, employing a FAST+ action tokenizer for efficient discretization of various action sequences [8]. Experimental Findings and Performance - In the SimplerEnv benchmark, the model outperformed existing state-of-the-art (SOTA) models, achieving success rates of 56.0% for picking a Coke can and 60.0% for moving near, with an overall average improvement of 4.9% after DPO [11]. - In the LIBERO benchmark, the model demonstrated a 1.0% increase in success rates for long-horizon tasks, achieving an average of 95.0%, surpassing other SOTA models [11]. Key Differences and Real-World Evaluation - Flow matching showed superior performance in large data scenarios, while more joint training is needed in smaller data contexts [14]. - In real robot evaluations, the NORA-1.5 model improved success rates by 13%-46% across nine pick-and-place tasks, with significant enhancements in unseen object and instruction scenarios [15]. Reward Optimization - The combination of WM (subgoal) and GTA rewards proved to be the most stable in real-world scenarios, avoiding the noise or bias associated with single rewards [17]. - The subgoal reward outperformed the endgoal reward by an average of 1.7%, particularly in complex environments [19].
VLA集体翻车?复旦&创智邱锡鹏教授团队提出LIBERO-Plus,揭示VLA脆弱性真相
具身智能之心· 2025-10-29 00:03
Core Insights - The article discusses the robustness analysis of Vision-Language-Action (VLA) models, revealing significant generalization deficiencies despite high performance scores in ideal conditions [2][4][6] - The LIBERO-Plus framework is introduced to systematically evaluate VLA models across various perturbation dimensions, highlighting the gap between surface performance and actual generalization capabilities [4][6][33] Group 1: Motivation and Contributions - VLA models have achieved impressive success rates in benchmarks like LIBERO, but existing evaluation methods fail to assess stability and reliability under real-world variations [4][6] - LIBERO-Plus evaluates models based on seven dimensions of perturbation: object placement, camera angle, robot initial pose, language instructions, lighting conditions, background textures, and sensor noise [4][6] - The framework provides a detailed analysis of VLA models' generalization performance through systematic perturbation [4][6] Group 2: Performance Analysis - The analysis reveals that VLA models exhibit significant overall vulnerability to perturbations, with performance declining across all dimensions [13][32] - Models are most sensitive to changes in camera perspective and robot initial state, indicating a need for high-level spatial and proprioceptive understanding [13][32] - Language perturbations lead to the smallest average performance drop (-25.3%), suggesting a surprising level of robustness that warrants further investigation [15][17] Group 3: Findings on Model Behavior - Some models maintain performance even with empty language inputs, indicating a tendency to ignore language modalities and behave more like visual-action (VA) models [16][19] - VLA models struggle with cross-object instruction following, relying more on fixed visual-action mappings rather than fully leveraging language signals [19][20] - The models demonstrate remarkable adaptability to background changes while showing limited sensitivity to lighting variations, raising questions about the representations they learn [20][27] Group 4: Combination Generalization - The concept of "combination generalization gap" is introduced, highlighting the negative interactions between different perturbations that exceed the independent effects of single perturbations [29][32] - The analysis indicates that current VLA models lack the ability to effectively handle complex multi-dimensional perturbations due to entangled representations [32] Group 5: LIBERO-Plus Benchmark - The LIBERO-Plus benchmark consists of 10,030 tasks designed to evaluate model performance under various perturbations, constructed using perturbation augmentation strategies [33][36] - The benchmark features include comprehensive coverage of seven perturbation dimensions and fine-grained difficulty levels [36] - Models trained with enhanced data achieved an average success rate of 79.6% on LIBERO-Plus, significantly outperforming baseline models [38]