南洋理工大学提出NORA-1.5：一种基于世界模型与动作奖励的VLA模型

Core Insights - The article discusses the introduction of NORA-1.5, a Vision-Language-Action (VLA) model that integrates flow-matching action experts and reward-driven Direct Preference Optimization (DPO) to address the issues of generalization and reliability in existing VLA models [1][3]. Architecture and Key Issues - The architecture focuses on the collaborative optimization of flow matching and the VLA backbone, targeting the pain points of reliability and generalization in real-world environments [3]. - The core solution involves adding flow-matching action experts on top of the pre-trained NORA backbone, along with a dual-component reward model and DPO post-training [3]. Flow-Matching Action Experts - The independent action expert directly regresses action sequences based on visual-language encoded key-value pairs from the VLA backbone, minimizing the difference between predicted and actual speeds [5]. - The dual-component reward mechanism balances goal orientation and stability, with core rewards derived from world model-guided target rewards and real action deviation rewards [6][9]. Training Process - The training process consists of two phases: joint training of action experts and DPO post-training [7]. - The model utilizes the Qwen-2.5-VL-3B visual language model and the Open X-Embodiment dataset for pre-training, employing a FAST+ action tokenizer for efficient discretization of various action sequences [8]. Experimental Findings and Performance - In the SimplerEnv benchmark, the model outperformed existing state-of-the-art (SOTA) models, achieving success rates of 56.0% for picking a Coke can and 60.0% for moving near, with an overall average improvement of 4.9% after DPO [11]. - In the LIBERO benchmark, the model demonstrated a 1.0% increase in success rates for long-horizon tasks, achieving an average of 95.0%, surpassing other SOTA models [11]. Key Differences and Real-World Evaluation - Flow matching showed superior performance in large data scenarios, while more joint training is needed in smaller data contexts [14]. - In real robot evaluations, the NORA-1.5 model improved success rates by 13%-46% across nine pick-and-place tasks, with significant enhancements in unseen object and instruction scenarios [15]. Reward Optimization - The combination of WM (subgoal) and GTA rewards proved to be the most stable in real-world scenarios, avoiding the noise or bias associated with single rewards [17]. - The subgoal reward outperformed the endgoal reward by an average of 1.7%, particularly in complex environments [19].