VLANeXt
Search documents
想入局VLA却不知从何下手?NTU&中大开源「终极菜谱」:从基座到频域建模,每一步都有实验支撑
量子位· 2026-03-02 16:00
Core Insights - The article discusses the development of VLANeXt, a new model that systematically analyzes the design space of Visual-Language-Action (VLA) models across 12 key dimensions, providing a comprehensive "recipe" for effective model design [1][5][20] - VLANeXt significantly outperforms various state-of-the-art (SOTA) methods, including models with 7 billion parameters, achieving a 10% increase in success rate under previously unseen conditions such as lighting and camera angles [1][23] Group 1: Background and Motivation - The rise of large foundational models has highlighted the potential of VLA models, which leverage rich visual and language understanding for scalable robotic learning [5] - The current VLA research landscape is fragmented, with various models claiming superior performance but lacking a unified framework for evaluation, necessitating a return to fundamental design principles [5] Group 2: Model Development Process - The research team began with a baseline model similar to RT2, using LLaMA as the backbone and focusing on action modeling through a simple architecture [7] - Key enhancements included the introduction of an independent policy module, deeper policy modeling, and action chunking to improve inference speed and model performance [9][11] Group 3: Foundational Components - The team discovered that decoupling language and action spaces and using an independent policy head significantly improved performance compared to reusing text tokens for action classification [9] - The model architecture was deepened to 29 layers to better capture action distributions, aligning with the backbone of the visual-language model (VLM) [9] Group 4: Perception Essentials - The study found that redundant historical visual information did not enhance performance, leading to the decision to use only the current frame's image [14] - Incorporating multi-view inputs, including third-person and wrist perspectives, provided complementary geometric cues, improving action accuracy [14] Group 5: Action Modeling Perspectives - The team explored the use of world models for action learning but opted against it due to increased training time, focusing instead on efficient modeling techniques [16] - They introduced frequency domain modeling through discrete cosine transforms (DCT) to enhance action prediction without significant additional training costs [16] Group 6: Experimental Results - VLANeXt demonstrated superior performance across various benchmarks, including LIBERO and LIBERO-plus, achieving an average score of 99.0 in the LIBERO benchmark [21][22] - The model's robustness was validated in real-world tasks, showing strong adaptability in both single-arm and bimanual scenarios without specialized pre-training [25]