Core Insights - The article discusses the development of a new model called CogVLA, which addresses the efficiency challenges and semantic degradation in Vision-Language-Action (VLA) research, driven by the capabilities of pre-trained Vision-Language Models (VLM) [5][6][10]. Group 1: Background and Challenges - The transition from large models to embodied intelligence faces efficiency dilemmas and semantic degradation, with existing VLA methods often neglecting the semantic coupling between perception, language alignment, and action decoding [5]. - Key challenges include redundant perception, instruction-semantic disconnection, and action incoherence, which hinder the performance of traditional VLA models [6][10]. Group 2: Proposed Solution - CogVLA introduces a cognitive-aligned three-stage design that mimics human multimodal coordination mechanisms, consisting of EFA-Routing, LFP-Routing, and CAtten [12][14]. - EFA-Routing focuses on instruction-driven visual aggregation, LFP-Routing performs semantic pruning in language models, and CAtten ensures semantic consistency and action sequence coherence [16]. Group 3: Experimental Results - CogVLA outperforms advanced models like OpenVLA-OFT and π0, achieving a state-of-the-art (SOTA) success rate of 97.4% on LIBERO while maintaining an 8× visual compression ratio [18]. - The model significantly improves efficiency, with inference time reduced by 2.79 times, throughput increased by 22.54 times, and training costs lowered by 2.49 times compared to OpenVLA [20]. Group 4: Visualization and Performance - Visual analysis demonstrates CogVLA's ability to focus on task-relevant areas in input images, showcasing its human-aligned perception capabilities even in chaotic or unclear scenes [21].
NeurIPS 2025 | 人类认知对齐的CogVLA,突破VLA效率与性能瓶颈
具身智能之心·2025-09-19 05:43