视觉-语言-动作（VLA） - filings, earnings calls, financial reports, news

视觉-语言-动作（VLA）

Search documents

量子位· 2025-06-25 05:00

Core Viewpoint - The article discusses the LeVERB framework developed by teams from UC Berkeley and Carnegie Mellon University, which enables humanoid robots to understand language commands and perform complex actions in new environments without prior training [1][3]. Group 1: LeVERB Framework Overview - LeVERB framework bridges the gap between visual semantic understanding and physical movement, allowing robots to perceive their environment and execute commands like humans [3][12]. - The framework consists of a hierarchical dual system that uses "latent action vocabulary" as an interface to connect high-level understanding and low-level action execution [17][20]. - The high-level component, LeVERB-VL, processes visual and language inputs to generate abstract commands, while the low-level component, LeVERB-A, translates these commands into executable actions [23][24]. Group 2: Performance and Testing - The framework was tested on the Unitree G1 robot, achieving an 80% zero-shot success rate in simple visual navigation tasks and an overall task success rate of 58.5%, outperforming traditional methods by 7.8 times [10][36]. - LeVERB-Bench, a benchmark for humanoid robot whole-body control (WBC), includes over 150 tasks and aims to provide realistic training data for visual-language-action models [7][26]. - The benchmark features diverse tasks such as navigation, reaching, and sitting, with a total of 154 visual-language tasks and 460 language-only tasks, generating extensive realistic motion trajectory data [30][31]. Group 3: Technical Innovations - The framework employs advanced techniques like ray tracing for realistic scene simulation and motion capture data to enhance the quality of training datasets [27][30]. - The training process involves optimizing the model through trajectory reconstruction and adversarial classification, ensuring efficient processing of visual-language information [23][24]. - Ablation studies indicate that components like the discriminator and kinematic encoder are crucial for maintaining model performance and enhancing generalization capabilities [38].