世界模型VLA！DriveVLA-W0：7000万数据解锁自动驾驶VLA Scaling（中科院&引望）

Core Insights - The article discusses the introduction of the DriveVLA-W0 training paradigm by the Chinese Academy of Sciences and Huawei, which addresses the "supervision deficit" issue in VLA models for autonomous driving [2][5][30] - The proposed method enhances the model's ability to learn from sparse action signals by incorporating world modeling tasks to generate dense self-supervised signals, thereby improving the model's performance as the training dataset scales [4][30][31] Summary by Sections Background - Scaling laws present an attractive path for achieving more generalizable driving intelligence, with expectations to utilize PB-level driving data for training robust foundational models [5] - The current challenge lies in the mismatch between the large scale of VLA models and the sparse supervision signals, leading to a "supervision deficit" that limits the model's ability to learn rich world representations [5][30] DriveVLA-W0 Paradigm - The DriveVLA-W0 paradigm introduces world modeling as a strong self-supervised approach to supplement sparse action signals, allowing the model to learn the underlying dynamics of driving environments [5][30] - The method has been validated on two mainstream VLA architectures, demonstrating significant improvements over baseline models [4][6] Experimental Validation - Extensive experiments on various datasets, including a large internal dataset of 70 million frames, confirm that the world modeling approach amplifies data scaling laws, leading to enhanced model performance [11][30] - The introduction of a lightweight action expert based on a mixture-of-experts (MoE) architecture reduces inference latency to 63.1% of the baseline model while maintaining strong performance [11][20] Key Contributions - The article identifies "supervision deficit" as a critical bottleneck in VLA scaling and proposes the DriveVLA-W0 paradigm to address this issue [11][30] - The findings reveal that as data scales up, the performance trend of action decoders reverses, with simpler autoregressive models outperforming more complex flow-matching models in large datasets [30][31] Conclusion - The research emphasizes that adopting predictive world modeling is crucial for unlocking the potential of large-scale data and achieving more generalizable driving intelligence [30][31]