清华和Qwen团队最新！深究VLM如何影响VLA性能？并通过少量新参数转化为VLA策略

Core Insights - The article emphasizes the transition from visual-language understanding to embodied action planning, highlighting the importance of the Visual-Language-Action (VLA) model as a key technology for embodied AI [3][10][26] - It discusses the necessity of integrating Visual-Language Models (VLM) with VLA to enhance the adaptability and performance of embodied agents in real-world scenarios [3][10][26] Summary by Sections Background - Early embodied AI relied on specialized robot models with limited generalization capabilities, leading to a shift towards integrating pre-trained VLMs into the VLA framework to improve action planning [3][10] - The relationship between VLM and VLA is defined, where VLM provides cognitive understanding and VLA translates this understanding into executable actions [3][10] Theoretical Foundation - VLM and VLA differ fundamentally in their goals, inputs, outputs, and optimization targets, marking a paradigm shift from understanding the world to modifying it [5][6] - VLA's optimization focuses on action execution success rates, contrasting with VLM's emphasis on understanding accuracy [5][6] VLA Construction Necessity - VLA enhances generalization and practicality by leveraging pre-trained VLM knowledge, significantly reducing development costs and accelerating technology deployment [10][26] - Experimental results show that VLA models initialized with VLM outperform those trained from scratch, validating the importance of this approach [10][26] Key Components - VLA performance is influenced by three main factors: VLM backbone model selection, auxiliary task fine-tuning, and module training strategies [11][12] - The selection of VLM models with varying parameter sizes (1B-30B) reveals that the introduction of learnable action query tokens can extract action-related information effectively [12][15] Training Strategies - Fine-tuning with auxiliary tasks does not necessarily enhance action performance, indicating that the relationship between embodied skills and action performance is complex [15][20] - The impact of freezing visual encoders on VLA performance is significant, with substantial drops in scores when visual encoders are not fine-tuned [21][22] Inference Mechanisms - VLA's action generation is based on a "cross-modal understanding - action mapping" inference process, with two main paradigms emerging: direct mapping and enhanced reasoning [17][19] - The direct mapping paradigm allows for efficient action generation, while the enhanced reasoning paradigm focuses on optimizing action generation modules for complex scenarios [17][19] Evaluation Framework - The evolution of VLA evaluation benchmarks reflects a shift from simple to complex scenarios and from single to multi-modal assessments, aligning more closely with real-world applications [23][24] - Core evaluation metrics include task success rates and average task completion numbers, with a focus on generalization capabilities in unseen scenarios [25][26] Future Directions - The article outlines key challenges and future research directions, including optimizing visual modules, developing adaptive architectures, and creating specialized evaluation systems [27][28] - Emphasis is placed on the need for a balanced approach between general data and embodied data to enhance VLA adaptability without compromising VLM capabilities [27][28]