近300篇工作！伦敦国王学院x港理工全面解构VLA模型，一份清晰系统的导航图

Core Insights - The article provides a comprehensive analysis of Vision-Language-Action (VLA) models, highlighting their transformative impact on robotics technology and outlining five core challenges: representation, execution, generalization, safety, and data evaluation [1][12]. Structure and Design - The research is structured to follow a natural learning path for researchers, progressing from foundational concepts to advanced topics, making it suitable for both beginners and experienced researchers [2]. Core Components of VLA Models - VLA systems consist of three main modules: perception, brain, and action, which have shown significant technological advancements in recent years. Key technical selections and representative models are referenced in related datasets and milestone tables [3][10]. Development Milestones - The evolution of VLA is characterized by a transition from passive multimodal perception to active embodied reasoning and control, with key models, datasets, and evaluation benchmarks organized in a timeline and tables [8][13]. Key Challenges and Solutions - The five major challenges in VLA model development span from foundational capabilities to practical deployment needs, with visual representations of their hierarchical relationships and sub-issues provided [12][24][25][26][27]. Application Scenarios and Future Directions - Major applications include household robots (handling unstructured environments and long-term tasks) and industrial or outdoor robots (high-precision operations and safety compliance). Performance evaluations of related application cases can be referenced in the datasets and benchmark tables [29][30]. Future Trends - The focus is on developing native multimodal architectures and shape-agnostic representations, constructing a closed-loop evolutionary system for self-supervised exploration and online reinforcement learning, and shifting evaluation from binary success rates to comprehensive diagnostic tests [29].