Workflow
VLA²
icon
Search documents
VLA2:浙大x西湖大学提出智能体化VLA框架,操作泛化能力大幅提升
具身智能之心· 2025-10-24 00:40
Core Insights - The article presents VLA², a framework designed to enhance the capabilities of vision-language-action models, particularly in handling unseen concepts in robotic tasks [1][3][12] Method Overview - VLA² integrates three core modules: initial information processing, cognition and memory, and task execution [3][5] - The framework utilizes GLM-4V for task decomposition, MM-GroundingDINO for object detection, and incorporates web image retrieval for visual memory enhancement [4][7] Experimental Validation - VLA² was compared with state-of-the-art (SOTA) models on the LIBERO Benchmark, showing competitive results, particularly excelling in scenarios requiring strong generalization [6][9] - In hard scenarios, VLA² achieved a 44.2% improvement in success rate over simply fine-tuning OpenVLA [9][10] Key Mechanisms - The framework's performance is significantly influenced by three mechanisms: visual mask injection, semantic replacement, and web retrieval [7][11] - Ablation studies confirmed that each mechanism contributes notably to the model's performance, especially in challenging tasks [11] Conclusion and Future Directions - VLA² successfully expands the cognitive and operational capabilities of VLA models for unknown objects, providing a viable solution for robotic tasks in open-world settings [12] - Future work will focus on exploring its generalization capabilities in real-world applications and expanding support for more tools and tasks [12]