视觉–语言–动作(VLA)模型
Search documents
a16z 最新洞察:具身智能从 Demo 到落地,必须跨越的5个鸿沟
3 6 Ke· 2026-01-16 14:02
Core Insights - The article discusses the challenges faced by the robotics industry in transitioning from research to practical deployment, highlighting that the real bottleneck lies in the production system rather than the strength of the models themselves [2][10]. Group 1: Current State of Robotics - The robotics industry has seen significant advancements in the last decade, particularly with the emergence of Visual-Language-Action (VLA) models, which integrate semantic understanding with robotic control [5]. - Despite the progress in research, the deployment of these technologies in real-world scenarios remains limited, with most industrial robots still performing highly deterministic tasks [10][11]. - The gap between research and deployment is characterized by a lack of integration between research labs and industrial systems, leading to a disconnect in capabilities [12][13]. Group 2: Factors Limiting Deployment - Five key factors are identified as barriers to the widespread adoption of embodied intelligence: distribution changes leading to performance drops, reliability thresholds, computational and latency challenges, system integration issues, and maintenance complexities [10][14][17][21][24]. - The performance metrics in research settings do not translate effectively to production environments, where variations in conditions can drastically reduce success rates [15]. - The need for high reliability in production systems contrasts with the performance maximization goals of research, creating a fundamental divide [18]. Group 3: Solutions and Future Directions - To bridge the gap between research and deployment, the industry needs to develop infrastructure akin to DevOps in software, focusing on data collection and operational reliability [28]. - The evolution of robotics is likely to occur in an ecosystem manner, where general capabilities are refined for specific tasks, expanding application boundaries over time [31]. - The competition between the U.S. and China in robotics is framed as a race to solve deployment challenges, with the ability to convert technological advantages into economic value being crucial for future success [32].
首个长程「VLA-World Model」一体化模型!ManualVLA解锁长程精细操作任务
具身智能之心· 2025-12-23 03:34
Core Viewpoint - The article introduces the ManualVLA model, a unified VLA model designed to enhance robotic manipulation and task execution by integrating planning and action generation into a single framework, addressing challenges in long-duration tasks that require precise final state definitions [2][5][10]. Group 1: Research Background and Challenges - Recent advancements in VLA models have significantly improved robotic scene understanding and generalization, yet challenges remain in coordinating high-level planning with precise operations for long-duration tasks like LEGO assembly and object rearrangement [7]. - Two main challenges are identified: the need for precise operations to align with predefined final configurations and the integration of long-term planning with fine-grained control while maintaining generalization capabilities in diverse real-world environments [7][9]. Group 2: ManualVLA Method Description - ManualVLA allows the model to generate its own instruction manual and execute actions based on it, breaking down complex long-duration tasks into controllable and interpretable short phases [12][19]. - The model employs a Mixture-of-Transformers (MoT) architecture, integrating a planning expert that generates multimodal operation manuals and an action expert that executes the actions based on these manuals [5][15]. - The ManualCoT reasoning mechanism combines explicit and implicit paths to influence action generation, ensuring a high degree of coordination between manual generation and action execution [16][20]. Group 3: Experimental Results - In real-world tasks, ManualVLA demonstrated a significant improvement in success rates, achieving an average success rate increase of approximately 32% compared to the latest baseline methods [28]. - The model's performance in generating intermediate target images was validated with metrics such as PSNR (e.g., 2D LEGO assembly at 29.01) and MAE (e.g., 2D LEGO assembly at 3.23), indicating high fidelity and accuracy in predicting target object positions [23][27]. - ManualVLA outperformed state-of-the-art methods in simulation tasks, achieving a 70% average success rate, surpassing the previous best of 63% [31]. Group 4: Ablation and Generalization Experiments - Ablation studies confirmed that all modalities of information in the instruction manual (text, images, UV coordinates) and the implicit CoT reasoning are essential for solving long-duration, goal-specific operational tasks [33]. - ManualVLA exhibited robust generalization capabilities under varying backgrounds, object shapes, and lighting conditions, maintaining high task success rates even in unseen scenarios [36].
北大发布 ManualVLA:首个长程「生成–理解–动作」一体化模型,实现从最终状态自主生成说明书并完成操纵
机器之心· 2025-12-18 09:08
Core Insights - The article discusses the limitations of existing VLA models in handling long-duration tasks that require precise final state definitions, such as LEGO assembly and object rearrangement, highlighting the need for a more integrated approach [2][9] - A new model called ManualVLA is introduced, which combines planning and action generation into a unified framework, improving the efficiency and effectiveness of robotic manipulation tasks [3][5] Group 1: Research Background and Challenges - Recent advancements in VLA models have significantly contributed to the development of general embodied intelligence, but challenges remain in coordinating high-level planning with precise control for long-duration tasks [9] - Existing hierarchical methods struggle with generalization to unseen final states and often rely on manually crafted instructions or human demonstration videos, leading to limitations in system complexity, deployment costs, and generalization capabilities [9] Group 2: ManualVLA Methodology - ManualVLA allows the model to generate its own instructions and execute actions based on those instructions, breaking down complex long-duration tasks into manageable steps [10][12] - The model employs a Mixture-of-Transformers (MoT) architecture, integrating a planning expert that generates multimodal operation manuals and an action expert that executes the tasks based on these manuals [5][14] Group 3: Experimental Results - ManualVLA demonstrated a significant improvement in success rates for real-world tasks, achieving an average success rate increase of approximately 32% compared to the latest baseline methods [7][28] - In experiments involving 2D LEGO assembly, 3D LEGO assembly, and object rearrangement, the model produced high-quality intermediate images and maintained a low mean absolute error (MAE) in predicting target object positions [24][27] Group 4: Training Phases - The training process consists of three phases: pre-training on a large dataset of robotic trajectories, utilizing a digital twin tool for 3D reconstruction and manual data generation, and fine-tuning on real-world expert demonstration trajectories [20][21][19] Group 5: Generalization and Robustness - ManualVLA exhibits robust generalization capabilities, maintaining high success rates even under varying backgrounds, object shapes, and lighting conditions, outperforming baseline models in these scenarios [33][37] - Ablation studies confirm that both explicit and implicit reasoning paths are essential for achieving optimal performance in long-duration tasks [33]