Workflow
GrinningFace基准
icon
Search documents
微软&港科对比多种迁移技术!VLA 到底如何有效地继承 VLM 中丰富的视觉-语义先验?
具身智能之心· 2025-11-15 16:03
Core Insights - The article discusses the introduction of the GrinningFace benchmark, which aims to address the challenges in knowledge transfer from Visual Language Models (VLM) to Visual Language Action Models (VLA) by using emoji-based tasks as a testing ground [1][2][4]. Group 1: Challenges in VLA Training - VLA training relies heavily on VLM initialization but faces three main challenges: unclear transfer effects, the risk of catastrophic forgetting, and lack of standardized comparison for different transfer techniques [2][4]. - Existing datasets have low overlap with VLM pre-training data, making it difficult to isolate contributions from "robotic action skills" and "VLM prior knowledge" [2]. Group 2: GrinningFace Benchmark Design - The GrinningFace benchmark uses emojis as a bridge to separate action execution from semantic recognition, allowing for precise measurement of knowledge transfer effects [4][5]. - The benchmark includes a standardized task where a robotic arm must place a cube on an emoji card based on language instructions [4]. Group 3: Evaluation Metrics - The evaluation framework consists of two core metrics: execution success rate (SR) and recognition SR, which quantify the robot's ability to perform actions and recognize semantic cues, respectively [5][8]. - The study found that different fine-tuning strategies have varying impacts on knowledge transfer, with a focus on retaining VLM prior knowledge while adapting to specific tasks [5][11]. Group 4: Key Findings on Transfer Techniques - The research highlights that co-training, latent action prediction, and diverse pre-training data are critical for effective knowledge transfer [7][19]. - The balance between retaining VLM prior knowledge and adapting robotic actions is identified as a core principle in VLA design [19]. Group 5: Future Directions - Future work should focus on optimizing parameter-efficient fine-tuning techniques, enhancing knowledge transfer efficiency, and designing complex tasks that reflect real-world applications [19]. - Exploring multimodal prior fusion, including tactile and auditory information, could improve VLA's adaptability to various environments [19].