视觉-语言-动作(VLA)
Search documents
NeurIPS 2025 | 人类认知对齐的CogVLA,突破VLA效率与性能瓶颈
具身智能之心· 2025-09-19 05:43
Core Insights - The article discusses the development of a new model called CogVLA, which addresses the efficiency challenges and semantic degradation in Vision-Language-Action (VLA) research, driven by the capabilities of pre-trained Vision-Language Models (VLM) [5][6][10]. Group 1: Background and Challenges - The transition from large models to embodied intelligence faces efficiency dilemmas and semantic degradation, with existing VLA methods often neglecting the semantic coupling between perception, language alignment, and action decoding [5]. - Key challenges include redundant perception, instruction-semantic disconnection, and action incoherence, which hinder the performance of traditional VLA models [6][10]. Group 2: Proposed Solution - CogVLA introduces a cognitive-aligned three-stage design that mimics human multimodal coordination mechanisms, consisting of EFA-Routing, LFP-Routing, and CAtten [12][14]. - EFA-Routing focuses on instruction-driven visual aggregation, LFP-Routing performs semantic pruning in language models, and CAtten ensures semantic consistency and action sequence coherence [16]. Group 3: Experimental Results - CogVLA outperforms advanced models like OpenVLA-OFT and π0, achieving a state-of-the-art (SOTA) success rate of 97.4% on LIBERO while maintaining an 8× visual compression ratio [18]. - The model significantly improves efficiency, with inference time reduced by 2.79 times, throughput increased by 22.54 times, and training costs lowered by 2.49 times compared to OpenVLA [20]. Group 4: Visualization and Performance - Visual analysis demonstrates CogVLA's ability to focus on task-relevant areas in input images, showcasing its human-aligned perception capabilities even in chaotic or unclear scenes [21].
人形机器人首次打通视觉感知与运动断层,UC伯克利华人博士让宇树G1现场演示
量子位· 2025-06-25 05:00
Core Viewpoint - The article discusses the LeVERB framework developed by teams from UC Berkeley and Carnegie Mellon University, which enables humanoid robots to understand language commands and perform complex actions in new environments without prior training [1][3]. Group 1: LeVERB Framework Overview - LeVERB framework bridges the gap between visual semantic understanding and physical movement, allowing robots to perceive their environment and execute commands like humans [3][12]. - The framework consists of a hierarchical dual system that uses "latent action vocabulary" as an interface to connect high-level understanding and low-level action execution [17][20]. - The high-level component, LeVERB-VL, processes visual and language inputs to generate abstract commands, while the low-level component, LeVERB-A, translates these commands into executable actions [23][24]. Group 2: Performance and Testing - The framework was tested on the Unitree G1 robot, achieving an 80% zero-shot success rate in simple visual navigation tasks and an overall task success rate of 58.5%, outperforming traditional methods by 7.8 times [10][36]. - LeVERB-Bench, a benchmark for humanoid robot whole-body control (WBC), includes over 150 tasks and aims to provide realistic training data for visual-language-action models [7][26]. - The benchmark features diverse tasks such as navigation, reaching, and sitting, with a total of 154 visual-language tasks and 460 language-only tasks, generating extensive realistic motion trajectory data [30][31]. Group 3: Technical Innovations - The framework employs advanced techniques like ray tracing for realistic scene simulation and motion capture data to enhance the quality of training datasets [27][30]. - The training process involves optimizing the model through trajectory reconstruction and adversarial classification, ensuring efficient processing of visual-language information [23][24]. - Ablation studies indicate that components like the discriminator and kinematic encoder are crucial for maintaining model performance and enhancing generalization capabilities [38].