Vision Language Action Model

Search documents
Long-VLA:西湖大学与阿里达摩院联合打造,全球首个支持长周期操作的端到端VLA模型
具身智能之心· 2025-08-29 04:00
Core Viewpoint - Long-VLA is the first end-to-end VLA model specifically designed for long-horizon tasks in robot manipulation, addressing the skill chain problem by introducing phase-aware input masks to dynamically adjust visual modalities during different task phases [2][4][14]. Technical Introduction - Existing technologies for long-horizon tasks can be categorized into three types: end-to-end unified models, task decomposition methods, and input-adaptive modular methods, each with limitations in handling long and complex tasks [3][4]. - Long-VLA combines the advantages of task decomposition within a unified architecture and dynamically adjusts perception modalities through input-level masking, effectively addressing the skill chain issue [4][6]. Model Description - Long-VLA's core design includes three key components: task phase division, input-level adaptation strategy, and unified end-to-end training. Tasks are divided into "movement phases" and "interaction phases," with a newly annotated L-CALVIN dataset to support this division [6][8]. - The input adaptation strategy employs a binary masking mechanism to dynamically adjust attention inputs, enhancing task continuity and mitigating phase distribution differences [6][8]. Experimental Results - In the optimized CALVIN environment, Long-VLA significantly outperformed baseline models in long-horizon tasks, demonstrating stability across ten consecutive sub-tasks [8][10]. - In real-world scenarios involving sorting and cleaning tasks, Long-VLA showed superior performance under varying conditions, confirming its robustness and generalization capabilities [10][12]. - Long-VLA achieved an average task length improvement over baseline methods, with notable increases in performance metrics [13]. Conclusion - This research establishes a balance between end-to-end training and long-horizon adaptability, laying the groundwork for further exploration in robot long-horizon task execution [14].