首个长程「VLA-World Model」一体化模型！ManualVLA解锁长程精细操作任务

Core Viewpoint - The article introduces the ManualVLA model, a unified VLA model designed to enhance robotic manipulation and task execution by integrating planning and action generation into a single framework, addressing challenges in long-duration tasks that require precise final state definitions [2][5][10]. Group 1: Research Background and Challenges - Recent advancements in VLA models have significantly improved robotic scene understanding and generalization, yet challenges remain in coordinating high-level planning with precise operations for long-duration tasks like LEGO assembly and object rearrangement [7]. - Two main challenges are identified: the need for precise operations to align with predefined final configurations and the integration of long-term planning with fine-grained control while maintaining generalization capabilities in diverse real-world environments [7][9]. Group 2: ManualVLA Method Description - ManualVLA allows the model to generate its own instruction manual and execute actions based on it, breaking down complex long-duration tasks into controllable and interpretable short phases [12][19]. - The model employs a Mixture-of-Transformers (MoT) architecture, integrating a planning expert that generates multimodal operation manuals and an action expert that executes the actions based on these manuals [5][15]. - The ManualCoT reasoning mechanism combines explicit and implicit paths to influence action generation, ensuring a high degree of coordination between manual generation and action execution [16][20]. Group 3: Experimental Results - In real-world tasks, ManualVLA demonstrated a significant improvement in success rates, achieving an average success rate increase of approximately 32% compared to the latest baseline methods [28]. - The model's performance in generating intermediate target images was validated with metrics such as PSNR (e.g., 2D LEGO assembly at 29.01) and MAE (e.g., 2D LEGO assembly at 3.23), indicating high fidelity and accuracy in predicting target object positions [23][27]. - ManualVLA outperformed state-of-the-art methods in simulation tasks, achieving a 70% average success rate, surpassing the previous best of 63% [31]. Group 4: Ablation and Generalization Experiments - Ablation studies confirmed that all modalities of information in the instruction manual (text, images, UV coordinates) and the implicit CoT reasoning are essential for solving long-duration, goal-specific operational tasks [33]. - ManualVLA exhibited robust generalization capabilities under varying backgrounds, object shapes, and lighting conditions, maintaining high task success rates even in unseen scenarios [36].