Workflow
世界建模
icon
Search documents
KAIST团队:基于双流扩散的世界模型增强VLA模型
具身智能之心· 2025-11-05 00:02
Group 1 - The core issue addressed in the article is the limitation of Vision-Language-Action models (VLAs) in modeling the impact of actions on the environment, which affects their generalization and robustness [3][4][8] - The proposed solution is the Dual-Stream Diffusion Framework (DUST), which aims to maintain modality specificity while enabling cross-modal knowledge sharing to resolve the modal conflict in joint predictions [5][10] Group 2 - DUST is built on the foundation of diffusion-based VLA designs, focusing on semantic feature extraction, action diffusion modeling, and a reasoning process that avoids pixel-level modeling costs [9][12] - The architecture of DUST includes a multi-modal diffusion Transformer (MMDiT) that separates the processing of action and visual streams while allowing for temporary information exchange through cross-modal attention layers [16][33] Group 3 - Experimental results demonstrate that DUST outperforms state-of-the-art models in both simulated and real-world scenarios, showing an average success rate improvement of 18% over GR00T-N1.5 and 5% over FLARE in simulated environments with 100 demonstrations [20][25] - DUST's ability to utilize unannotated video data for pre-training significantly reduces the reliance on costly robot demonstration data, achieving a 13% higher average success rate compared to GR00T-N1.5 in transfer learning tasks [25][26] Group 4 - The article highlights the importance of asynchronous joint sampling strategies in DUST, which allows for flexible balancing between prediction accuracy and inference speed by adjusting the number of denoising steps for different modalities [18][28] - The necessity of DUST's core components is validated through ablation studies, confirming that the combination of dual-stream architecture and decoupled training is essential for optimal performance [29][30]
世界模型VLA!DriveVLA-W0:7000万数据解锁自动驾驶VLA Scaling(中科院&引望)
自动驾驶之心· 2025-10-17 00:03
Core Insights - The article discusses the introduction of the DriveVLA-W0 training paradigm by the Chinese Academy of Sciences and Huawei, which addresses the "supervision deficit" issue in VLA models for autonomous driving [2][5][30] - The proposed method enhances the model's ability to learn from sparse action signals by incorporating world modeling tasks to generate dense self-supervised signals, thereby improving the model's performance as the training dataset scales [4][30][31] Summary by Sections Background - Scaling laws present an attractive path for achieving more generalizable driving intelligence, with expectations to utilize PB-level driving data for training robust foundational models [5] - The current challenge lies in the mismatch between the large scale of VLA models and the sparse supervision signals, leading to a "supervision deficit" that limits the model's ability to learn rich world representations [5][30] DriveVLA-W0 Paradigm - The DriveVLA-W0 paradigm introduces world modeling as a strong self-supervised approach to supplement sparse action signals, allowing the model to learn the underlying dynamics of driving environments [5][30] - The method has been validated on two mainstream VLA architectures, demonstrating significant improvements over baseline models [4][6] Experimental Validation - Extensive experiments on various datasets, including a large internal dataset of 70 million frames, confirm that the world modeling approach amplifies data scaling laws, leading to enhanced model performance [11][30] - The introduction of a lightweight action expert based on a mixture-of-experts (MoE) architecture reduces inference latency to 63.1% of the baseline model while maintaining strong performance [11][20] Key Contributions - The article identifies "supervision deficit" as a critical bottleneck in VLA scaling and proposes the DriveVLA-W0 paradigm to address this issue [11][30] - The findings reveal that as data scales up, the performance trend of action decoders reverses, with simpler autoregressive models outperforming more complex flow-matching models in large datasets [30][31] Conclusion - The research emphasizes that adopting predictive world modeling is crucial for unlocking the potential of large-scale data and achieving more generalizable driving intelligence [30][31]