Workflow
交错图文指令
icon
Search documents
Interleave-VLA:首个支持交错图文指令的VLA框架,跨域泛化提升2-3倍
具身智能之心· 2025-08-05 00:03
Core Viewpoint - The article introduces the Interleave-VLA framework, which enhances robot manipulation by utilizing interleaved image-text instructions, demonstrating significant improvements in performance over existing models [2][3][7]. Group 1: Interleave-VLA Framework - Interleave-VLA is the first framework capable of understanding interleaved image-text instructions and generating continuous action sequences in the physical world [2]. - The framework is model-agnostic and requires minimal modifications to current state-of-the-art VLA models, providing strong zero-shot generalization capabilities [2][3]. Group 2: Data Set Development - A major challenge in implementing Interleave-VLA was the lack of a large-scale interleaved embodied dataset. To address this, an automated process was developed to convert pure text instructions from the Open X-Embodiment dataset into interleaved image-text instructions [2]. - The resulting dataset contains 210,000 interaction data points and 13 million frames of images, marking the first large-scale real-world interleaved embodied dataset [2]. Group 3: Performance Evaluation - Comprehensive evaluations in simulation benchmarks and real robot experiments show that Interleave-VLA significantly enhances cross-domain generalization capabilities by 2-3 times compared to state-of-the-art baseline models [3]. - The framework supports flexible task interfaces and can handle various user-provided image instructions, such as hand-drawn sketches, in a zero-shot manner [3]. Group 4: Advantages of Interleaved Instructions - The interleaved instruction paradigm effectively utilizes heterogeneous datasets and diverse instruction images, including those sourced from the internet, showcasing its substantial scalability potential [3][7].