从机械臂到人形，跨构型VLA如何破局?

Core Insights - The article discusses two significant advancements in the field of embodied intelligence and VLA (Vision-Language Action) models, highlighting their potential to overcome existing challenges in the domain [3][7]. Group 1: VLA-Adapter - VLA-Adapter aims to improve the direct mapping from VLM (Vision-Language Model) features to action space without heavily relying on robotic data. The research team found that increasing the parameter count and introducing pre-trained robotic data did not significantly enhance model performance on general benchmarks [3]. - The new mapping scheme proposed by the team allows the model to achieve superior performance even at a 0.5 billion parameter scale, reducing training costs and lowering the entry barrier for VLA models [3]. Group 2: TrajBooster - TrajBooster is the first full-body humanoid operation VLA solution that addresses data scarcity issues for training VLA models in bipedal humanoid tasks. The scarcity arises from the high cost of remote operation data and the challenges of using existing heterogeneous robot data for training [7]. - By focusing on trajectory-centered methods, TrajBooster efficiently utilizes cross-body data, achieving full-body operation in bipedal robots with just 10 minutes of real machine remote operation data for fine-tuning [7]. Group 3: Contributors - Wang Yihao, a fourth-year PhD student at Beijing University of Posts and Telecommunications, is involved in the VLA-Adapter project and has contributed significantly to the field of embodied intelligence and VLA models [13]. - Liu Jiacheng, a second-year PhD student at Zhejiang University and West Lake University, leads the TrajBooster project, which is the only fully open-source work covering humanoid data collection, cross-body data enhancement, VLA model training, and hardware deployment [13].