Workflow
视频生成模型
icon
Search documents
PhysicalAgent:迈向通用认知机器人的基础世界模型框架
具身智能之心· 2025-09-22 00:03
Core Viewpoint - The article discusses the development of PhysicalAgent, a robotic control framework designed to overcome key limitations in the current robot manipulation field, specifically addressing the robustness and generalizability of visual-language-action (VLM) models and world model-based methods [2][3]. Group 1: Key Bottlenecks and Solutions - Current VLM models require task-specific fine-tuning, leading to a significant drop in robustness when switching robots or environments [2]. - World model-based methods depend on specially trained predictive models, limiting their generalizability due to the need for carefully curated training data [2]. - PhysicalAgent aims to integrate iterative reasoning, diffusion video generation, and closed-loop execution to achieve cross-modal and cross-task general manipulation capabilities [2]. Group 2: Framework Design Principles - The framework's design allows perception and reasoning modules to remain independent of specific robot forms, requiring only lightweight skeletal detection models for different robots [3]. - Video generation models have inherent advantages due to pre-training on vast multimodal datasets, enabling quick integration without local training [5]. - The framework aligns with human-like reasoning, generating visual representations of actions based solely on textual instructions [5]. - The architecture demonstrates cross-modal adaptability by generating different manipulation tasks for various robot forms without retraining [5]. Group 3: VLM as the Cognitive Core - VLM serves as the cognitive core of the framework, facilitating a multi-step process of instruction, environment interaction, and execution [6]. - The innovative approach redefines action generation as conditional video synthesis rather than direct control strategy learning [6]. - The robot adaptation layer is the only part requiring specific robot tuning, converting generated action videos into motor commands [6]. Group 4: Experimental Validation - Two types of experiments were conducted to validate the framework's cross-modal generalization and iterative execution robustness [8]. - The first experiment focused on verifying the framework's performance against task-specific baselines and its ability to generalize across different robot forms [9]. - The second experiment assessed the iterative execution capabilities of physical robots, demonstrating the effectiveness of the "Perceive→Plan→Reason→Act" pipeline [12]. Group 5: Key Results - The framework achieved an 80% final success rate across various tasks for both the bimanual UR3 and humanoid G1 robots [13][16]. - The first-attempt success rates were 30% for UR3 and 20% for G1, with average iterations required for success being 2.25 and 2.75, respectively [16]. - The iterative correction process significantly improved task completion rates, with a sharp decline in the proportion of unfinished tasks after the first few iterations [16].
宇树科技王兴兴发“暴论”,对智驾有什么参考?
3 6 Ke· 2025-08-11 23:58
"VLA模型是相对傻瓜式的架构。" 2025年8月9日,在北京举办的2025世界机器人大会上,宇树科技的创始人、CEO兼CTO王兴兴在演讲中这样说道。 尽管他是针对具身智能大模型发表这一看法的,但对于当前智能驾驶最热门模型方向,不得不让人有些错愕。 包括极佳视界的CEO黄冠也在吐槽他的观点"太业余"。 大会上,他从核心瓶颈、新兴技术引擎及未来技术重心三个方面,对具身智能机器人的发展态势进行梳理与分析。我们不妨看看,这位大红人的观点,有 什么启发。 从技术层面而言,人形机器人的硬件,诸如灵巧手和整机等,已足够满足基本需求,尽管在工程实施上仍存在诸多挑战,但已能够支撑基础功能的实现。 他认为,限制其大规模应用的核心瓶颈,在于具身智能的 AI 大模型尚未成熟。 王兴兴认为,目前的机器人大模型(具身智能)发展阶段,类似ChatGPT 发布前的1~3年,即业界已明确方向和技术路线,但尚未突破关键临界点。 在王兴兴看来,之所以没达到关键临界点,主要是由于行业对"数据" 的关注度过高,却忽视了模型本身的问题。 核心瓶颈:模型不够好 谈及机器人未大规模应用的原因,很多人误认为是硬件性能不足或成本过高。但王兴兴指出,当前机器人 ...