ManualVLA
Search documents
真机RL杀疯了,机器人自学20分钟100分,数字孪生封神
3 6 Ke· 2026-02-13 07:32
然后你把香蕉往右边挪了15厘米。 机械臂愣住了。 它不是「没学好」,而是从来没见过那个位置。 对它来说,桌子右半边就是另一个宇宙。 这不是段子,这是2025年几乎所有VLA模型在真实世界里的真实处境。 【导读】TwinRL用手机扫一遍场景构建数字孪生,让机器人先在数字孪生里大胆探索、精准试错,再回到真机20分钟跑满全桌面100%成功率——比现 有方法快30%,人类干预减少一半以上。 让机器人真正「走出演示数据」的那一刻,发生了什么? 你花了两周时间,手把手遥操作教一个机械臂抓香蕉放盘子。桌子左半边,它学得像模像样,十拿九稳。 过去两年,Vision-Language-Action(VLA)模型席卷机器人领域。 从「看图+听话+动手」,到多任务、多场景的泛化执行,VLA让机器人第一次看起来像「理解世界」的智能体。 论文里成功率动辄90%以上,演示视频拍得漂亮极了。 但真正做过真机实验的人都知道,这里面藏着一个所有人都心知肚明、却很少有人正面回答的问题: 如果没有人类不断示范,机器人还能不能自己学? 答案是——几乎不能。 现实的残酷在于: 但这些都还不是最致命的。 最致命的是—— RL的探索空间,被SFT演示数据 ...
首个长程「VLA-World Model」一体化模型!ManualVLA解锁长程精细操作任务
具身智能之心· 2025-12-23 03:34
Core Viewpoint - The article introduces the ManualVLA model, a unified VLA model designed to enhance robotic manipulation and task execution by integrating planning and action generation into a single framework, addressing challenges in long-duration tasks that require precise final state definitions [2][5][10]. Group 1: Research Background and Challenges - Recent advancements in VLA models have significantly improved robotic scene understanding and generalization, yet challenges remain in coordinating high-level planning with precise operations for long-duration tasks like LEGO assembly and object rearrangement [7]. - Two main challenges are identified: the need for precise operations to align with predefined final configurations and the integration of long-term planning with fine-grained control while maintaining generalization capabilities in diverse real-world environments [7][9]. Group 2: ManualVLA Method Description - ManualVLA allows the model to generate its own instruction manual and execute actions based on it, breaking down complex long-duration tasks into controllable and interpretable short phases [12][19]. - The model employs a Mixture-of-Transformers (MoT) architecture, integrating a planning expert that generates multimodal operation manuals and an action expert that executes the actions based on these manuals [5][15]. - The ManualCoT reasoning mechanism combines explicit and implicit paths to influence action generation, ensuring a high degree of coordination between manual generation and action execution [16][20]. Group 3: Experimental Results - In real-world tasks, ManualVLA demonstrated a significant improvement in success rates, achieving an average success rate increase of approximately 32% compared to the latest baseline methods [28]. - The model's performance in generating intermediate target images was validated with metrics such as PSNR (e.g., 2D LEGO assembly at 29.01) and MAE (e.g., 2D LEGO assembly at 3.23), indicating high fidelity and accuracy in predicting target object positions [23][27]. - ManualVLA outperformed state-of-the-art methods in simulation tasks, achieving a 70% average success rate, surpassing the previous best of 63% [31]. Group 4: Ablation and Generalization Experiments - Ablation studies confirmed that all modalities of information in the instruction manual (text, images, UV coordinates) and the implicit CoT reasoning are essential for solving long-duration, goal-specific operational tasks [33]. - ManualVLA exhibited robust generalization capabilities under varying backgrounds, object shapes, and lighting conditions, maintaining high task success rates even in unseen scenarios [36].
北大发布 ManualVLA:首个长程「生成–理解–动作」一体化模型,实现从最终状态自主生成说明书并完成操纵
机器之心· 2025-12-18 09:08
Core Insights - The article discusses the limitations of existing VLA models in handling long-duration tasks that require precise final state definitions, such as LEGO assembly and object rearrangement, highlighting the need for a more integrated approach [2][9] - A new model called ManualVLA is introduced, which combines planning and action generation into a unified framework, improving the efficiency and effectiveness of robotic manipulation tasks [3][5] Group 1: Research Background and Challenges - Recent advancements in VLA models have significantly contributed to the development of general embodied intelligence, but challenges remain in coordinating high-level planning with precise control for long-duration tasks [9] - Existing hierarchical methods struggle with generalization to unseen final states and often rely on manually crafted instructions or human demonstration videos, leading to limitations in system complexity, deployment costs, and generalization capabilities [9] Group 2: ManualVLA Methodology - ManualVLA allows the model to generate its own instructions and execute actions based on those instructions, breaking down complex long-duration tasks into manageable steps [10][12] - The model employs a Mixture-of-Transformers (MoT) architecture, integrating a planning expert that generates multimodal operation manuals and an action expert that executes the tasks based on these manuals [5][14] Group 3: Experimental Results - ManualVLA demonstrated a significant improvement in success rates for real-world tasks, achieving an average success rate increase of approximately 32% compared to the latest baseline methods [7][28] - In experiments involving 2D LEGO assembly, 3D LEGO assembly, and object rearrangement, the model produced high-quality intermediate images and maintained a low mean absolute error (MAE) in predicting target object positions [24][27] Group 4: Training Phases - The training process consists of three phases: pre-training on a large dataset of robotic trajectories, utilizing a digital twin tool for 3D reconstruction and manual data generation, and fine-tuning on real-world expert demonstration trajectories [20][21][19] Group 5: Generalization and Robustness - ManualVLA exhibits robust generalization capabilities, maintaining high success rates even under varying backgrounds, object shapes, and lighting conditions, outperforming baseline models in these scenarios [33][37] - Ablation studies confirm that both explicit and implicit reasoning paths are essential for achieving optimal performance in long-duration tasks [33]