Workflow
思维链(CoT)推理
icon
Search documents
华科&清华最新DeepThinkVLA:如何让模型 “会思考、能落地”?
具身智能之心· 2025-11-24 10:02
Core Insights - The article presents DeepThinkVLA, a new model that addresses the challenges in the visual-language-action (VLA) domain by integrating a mixed attention decoder and a two-stage training pipeline, achieving a task success rate of 97.0% on the LIBERO benchmark, setting a new performance standard for VLA models [2][14]. Group 1: Model Architecture - DeepThinkVLA resolves the "modal conflict" between reasoning and action by employing a mixed attention mechanism that allows for efficient processing of both modalities within a single decoder [4][10]. - The model features a dynamic switching mechanism between causal attention for reasoning generation and bidirectional attention for action generation, significantly reducing inference latency and enhancing performance [4][10]. Group 2: Training Methodology - The training process consists of a two-stage pipeline combining supervised fine-tuning (SFT) and reinforcement learning (RL), which enhances the model's reasoning capabilities while ensuring effective action execution [6][8]. - The SFT phase focuses on building foundational reasoning skills through a carefully designed data augmentation pipeline, resulting in a dataset of 273,465 annotated frames [10][12]. Group 3: Innovations and Mechanisms - Two key innovations are highlighted: the probabilistic decomposition of reasoning and action, and an error recovery mechanism that allows the model to self-correct during execution [10][11]. - The reward design incorporates task-success rewards and format regularization rewards, focusing on the final success of tasks while minimizing interference from intermediate reasoning semantics [11][12]. Group 4: Performance Evaluation - DeepThinkVLA outperforms existing models across various tasks, achieving an average success rate of 97.0%, with specific task success rates of 99.0% for Object tasks and 96.4% for Goal tasks [14][15]. - The model demonstrates superior robustness compared to top autoregressive models, showcasing its effectiveness in complex robotic operations [15][16]. Group 5: Future Directions - Future enhancements may include integrating additional sensory data, expanding to more complex collaborative tasks, optimizing efficiency, and constructing larger datasets to improve model generalization [23][24].