PEVA模型

Search documents
LeCun发布最新世界模型:首次实现16秒连贯场景预测,具身智能掌握第一视角!还打脸用了VAE
量子位· 2025-06-30 06:38
Core Viewpoint - Yann LeCun, a prominent figure in AI and deep learning, is focusing on developing a new model called PEVA, which aims to enhance embodied agents' predictive capabilities, allowing them to anticipate actions similarly to humans [2][10]. Group 1: PEVA Model Development - The PEVA model enables embodied agents to learn predictive abilities, achieving coherent scene predictions for up to 16 seconds [2][6]. - The model integrates structured action representation with 48-dimensional kinematic data of human joints and a conditional diffusion Transformer [3][20]. - PEVA utilizes first-person perspective video and full-body pose trajectories as inputs, moving away from abstract control signals [4][12]. Group 2: Technical Innovations - The model addresses computational efficiency and delay issues in long-sequence action prediction through random time jumps and cross-historical frame attention [5][24]. - PEVA captures both "overall movement" and "fine joint movements" using high-dimensional structured data, which traditional models fail to represent accurately [16][18]. - The architecture employs a hierarchical tree structure for motion encoding, ensuring translation and rotation invariance [25]. Group 3: Performance Metrics - PEVA outperforms baseline models in various tasks, showing lower LPIPS and FID values, indicating higher visual similarity and better generation quality [33][35]. - In single-step predictions, PEVA's LPIPS value is 0.303, and FID is 62.29, demonstrating its effectiveness compared to the CDiT baseline [33][35]. - The model's ability to predict visual changes within 2 seconds and generate coherent videos for up to 16 seconds marks a significant advancement in embodied AI [40]. Group 4: Practical Applications - PEVA can intelligently plan actions by evaluating multiple options and selecting the most appropriate sequence, mimicking human trial-and-error planning [42]. - The model's capabilities could lead to more efficient robotic systems, such as vacuum cleaners that can anticipate obstacles and navigate more effectively [51].
UCLA提出PEVA:具身Agents的世界模型时代
具身智能之心· 2025-06-30 03:47
点击下方 卡片 ,关注" 具身智能 之心 "公众号 作者丨 Yutong Bai等 编辑丨具身智能之心 本文只做学术分享,如有侵权,联系删文 >> 点击进入→ 具身智能之心 技术交流群 更多干货,欢迎加入国内首个具身智能全栈学习社区 : 具身智能之心知识星球 (戳我) , 这里包含所有你想要 的。 背景与动机 本篇论文探讨了具身智能体理解 物理动作与视觉感知关系 的根本挑战。人类通过全身动作(如转身、伸 手)主动改变第一人称视角的视觉输入,这对智能体的环境交互和长期规划至关重要。现有世界模型(如 基于速度控制的导航模型)存在显著局限: 这些局限阻碍了智能体在真实场景中的物理交互能力。该研究提出 PEVA模型 ,首次将全身3D姿态作为条 件信号预测第一人称视频,为具身智能提供物理基础更扎实的仿真环境。内容出自国内首个具身智能全栈 学习社区:具身智能之心知识星球,欢迎和近200家公司和机构交流。 核心创新点 1. 结构化全身动作表征 关键突破 :将动作定义为48维向量,融合全局身体运动(骨盆位移)与局部关节旋转(15个上半身关 节的欧拉角变化),通过运动学树结构保留层次关系。 1. 动作表征简化 :多数模型采用低 ...