Core Insights - The article introduces ThinkAct, a dual-system framework designed to enhance the reasoning capabilities of multi-modal large language models (MLLMs) in physical environments by connecting high-level reasoning with low-level action execution [4][9][12] - ThinkAct aims to address the limitations of existing VLA models that struggle with long-term planning and adapting to complex tasks by utilizing reinforced visual latent planning [4][6][9] Group 1: Framework and Methodology - ThinkAct employs a structured approach to VLA reasoning tasks, where the model receives visual observations and textual instructions to predict actions, effectively linking abstract planning with low-level control [12][21] - The framework utilizes reinforcement learning to enhance the reasoning capabilities of MLLMs, encouraging them to generate low-level actions after reasoning through the task [13][19] - A novel action-aligned visual feedback mechanism is introduced to capture long-term goals and encourage visual associations during the planning process [14][18] Group 2: Performance Evaluation - ThinkAct demonstrates superior performance in various robotic operation tasks, achieving a top success rate of 84.4% on the LIBERO benchmark, outperforming other models like DiT-Policy and CoT-VLA [25][26] - In the SimplerEnv evaluation, ThinkAct outperformed baseline action models by significant margins, achieving overall scores of 71.5%, 65.1%, and 43.8% across different settings [25] - The framework also excels in embodied reasoning tasks, showing advantages in long-term and multi-step planning capabilities, as evidenced by its performance on EgoPlan-Bench2 and RoboVQA benchmarks [26][27] Group 3: Qualitative Insights - The article provides qualitative examples illustrating ThinkAct's reasoning process and execution in tasks, showcasing its ability to decompose instructions into meaningful sub-goals and visualize planning trajectories [30][31] - The framework's reinforcement learning adjustments significantly enhance its reasoning capabilities, allowing it to better understand tasks and environments compared to cold-start models [31][32] Group 4: Adaptability and Error Correction - ThinkAct demonstrates effective few-shot adaptation capabilities, successfully generalizing to unseen environments and new skills with minimal demonstration samples [35][37] - The framework's ability to detect execution errors and perform ego correction is highlighted, showcasing its structured reasoning to reconsider tasks and generate corrective plans when faced with failures [37][38]
NVIDIA最新!ThinkAct:复杂的具身任务中实现少样本适应、长时程规划
具身智能之心·2025-07-24 09:53