LeCun发布最新世界模型：首次实现16秒连贯场景预测，具身智能掌握第一视角！还打脸用了VAE

Core Viewpoint - Yann LeCun, a prominent figure in AI and deep learning, is focusing on developing a new model called PEVA, which aims to enhance embodied agents' predictive capabilities, allowing them to anticipate actions similarly to humans [2][10]. Group 1: PEVA Model Development - The PEVA model enables embodied agents to learn predictive abilities, achieving coherent scene predictions for up to 16 seconds [2][6]. - The model integrates structured action representation with 48-dimensional kinematic data of human joints and a conditional diffusion Transformer [3][20]. - PEVA utilizes first-person perspective video and full-body pose trajectories as inputs, moving away from abstract control signals [4][12]. Group 2: Technical Innovations - The model addresses computational efficiency and delay issues in long-sequence action prediction through random time jumps and cross-historical frame attention [5][24]. - PEVA captures both "overall movement" and "fine joint movements" using high-dimensional structured data, which traditional models fail to represent accurately [16][18]. - The architecture employs a hierarchical tree structure for motion encoding, ensuring translation and rotation invariance [25]. Group 3: Performance Metrics - PEVA outperforms baseline models in various tasks, showing lower LPIPS and FID values, indicating higher visual similarity and better generation quality [33][35]. - In single-step predictions, PEVA's LPIPS value is 0.303, and FID is 62.29, demonstrating its effectiveness compared to the CDiT baseline [33][35]. - The model's ability to predict visual changes within 2 seconds and generate coherent videos for up to 16 seconds marks a significant advancement in embodied AI [40]. Group 4: Practical Applications - PEVA can intelligently plan actions by evaluating multiple options and selecting the most appropriate sequence, mimicking human trial-and-error planning [42]. - The model's capabilities could lead to more efficient robotic systems, such as vacuum cleaners that can anticipate obstacles and navigate more effectively [51].