Workflow
MiVLA模型
icon
Search documents
超越π0.5,MiVLA通过人机相互模仿预训练,破解 VLA 模型泛化与数据瓶颈
具身智能之心· 2025-12-22 01:22
Core Insights - The article discusses the MiVLA model, which addresses the challenges of "data scarcity" and "generalization weakness" in the field of robot vision-language-action (VLA) models by utilizing a novel "human-robot mutual imitation pre-training" approach, allowing for effective training without real robot data [2][19] - MiVLA combines simulated robot data and human video data to achieve superior generalization capabilities, providing a low-cost and scalable path for general robot policy learning [2][19] Summary by Sections Need for Reconstructing VLA Pre-training Paradigm - Current VLA training faces dual challenges: reliance on real robot data is limited by high costs and limited scene coverage, while single-modal approaches suffer from "modal gaps" [3] - Effective VLA pre-training requires a unified approach that balances data scale, behavioral fidelity, and cross-modal adaptation [3] MiVLA's Design and Features - MiVLA's core design is based on aligning human and robot action spaces through mutual imitation pre-training, merging the diversity of simulated robot data with the fidelity of human video data [5] - Key features include: - Bidirectional human-robot action space mapping to overcome morphological differences [7] - Mutual imitation pre-training that leverages dual-source data advantages [8] - A diffusion transformer architecture to support continuous robot control [8] - Lightweight and efficient training for scalable deployment [8] Experimental Validation and Results - MiVLA was tested in both simulated and real robot environments, demonstrating significant performance improvements over baseline models [9][11] - In simulated tasks, MiVLA outperformed baseline models in 20 representative tasks, achieving an average success rate of 69% in easy mode and 66% in hard mode [10] - In real robot tasks, MiVLA matched the performance of large-scale real data pre-trained models using only medium-scale mixed data [11] Generalization Capability - MiVLA exhibited strong adaptability across different scenes, objects, and positions, achieving an average generalization success rate of 54% with only 20 demonstration data points [17][18] - The model's ability to handle unknown robot forms and complex tasks was validated through various experimental setups [11][14] Conclusion and Future Directions - MiVLA demonstrates that human-robot mutual imitation is key to overcoming data bottlenecks, allowing for the construction of a more generalized VLA model without real robot data [18] - Future improvements will focus on enhancing performance in extreme out-of-distribution scenarios, integrating multimodal information, and expanding data coverage [18]