FVLMoE模块

Search documents
ForceVLA:通过力感知MoE增强接触丰富操作的VLA模型
具身智能之心· 2025-06-18 10:41
Research Background and Problem Statement - The article discusses the limitations of existing Vision-Language-Action (VLA) models in robot manipulation, particularly in tasks requiring fine control of force under visual occlusion or dynamic uncertainty. These models heavily rely on visual and language cues while neglecting force sensing, which is crucial for precise physical interactions [4]. Core Innovations - **ForceVLA Framework**: A novel end-to-end operational framework that incorporates external force sensing as a primary modality within the VLA system. It introduces the Force-aware Mixture of Experts (FVLMoE) module, which dynamically integrates pre-trained visual-language embeddings with real-time 6-axis force feedback during action decoding, enabling robots to adapt to subtle contact dynamics [6][8]. - **FVLMoE Module**: This module allows for context-aware routing across modality-specific experts, enhancing the physical interaction capabilities of VLA systems by dynamically processing and integrating force, visual, and language features [7][8]. - **ForceVLA-Data Dataset**: A new dataset created to support the training and evaluation of force-aware operational strategies, containing synchronized visual, proprioceptive, and force-torque signals across five contact-rich tasks. The dataset will be open-sourced to promote community research [9]. Methodology Details - **Overall Architecture**: Built on the π₀ framework, ForceVLA integrates visual, language, proprioceptive, and 6-axis force feedback to generate actions through a conditional flow matching model. Visual inputs and task instructions are encoded into context embeddings, which are combined with proprioceptive and force cues to predict action trajectories [11]. - **FVLMoE Module Design**: The module processes force features as an independent input after visual-language processing, using a sparse mixture of experts layer to dynamically select the most suitable expert for each token, enhancing the integration of multimodal features [12][14]. Experimental Results - **Performance Evaluation**: The evaluation was conducted on five contact-rich tasks, with task success rates as the primary metric. The results showed that ForceVLA achieved an average success rate of 60.5%, significantly outperforming the π₀-base model without force feedback, which had a success rate of 37.3% [25]. - **Ablation Studies**: The experiments demonstrated that the adaptive fusion achieved by the FVLMoE module led to an 80% success rate, validating the importance of integrating force after visual-language model encoding [23][26]. - **Multi-task Evaluation**: ForceVLA exhibited excellent multi-task capabilities, achieving an average success rate of 67.5% across various tasks, with a 100% success rate in the plug insertion task, showcasing its ability to leverage multimodal cues in shared strategies [27].