EgoTwin

Search documents
首次实现第一视角视频与人体动作同步生成!新框架攻克视角-动作对齐两大技术壁垒
量子位· 2025-10-01 01:12
生成的视频可以通过从人体动作推导出的相机位姿,借助 3D 高斯点渲染(3D Gaussian Splatting)提升到三维场景中。 闻乐 发自 凹非寺 量子位 | 公众号 QbitAI AI生成第三视角视频已经驾轻就熟,但第一视角生成却仍然"不熟"。 为此,新加坡国立大学、南洋理工大学、香港科技大学与上海人工智能实验室联合发布 EgoTwin , 首次实现了第一视角视频与人体动作的 联合生成 。 一举攻克了 视角-动作对齐 与 因果耦合 两大瓶颈,为可穿戴计算、AR及具身智能打开落地新入口。 EgoTwin 是一个基于扩散模型的框架,能够以视角一致且因果连贯的方式联合生成第一人称视角视频和人体动作。 下面具体来看。 第一视角视频与人体动作同步生成 核心挑战:第一视角生成的"两难困境" 第一视角视频的本质是 人体动作驱动的视觉记录 ——头部运动决定相机的位置与朝向,全身动作则影响身体姿态与周围场景变化。 二者之间存在内在的耦合关系,无法被单独分离。传统视频生成方法难以适配这一特性,主要面临两大难题: 1. 视角对齐难题 生成视频中的相机轨迹,必须与人体动作推导的头部轨迹精准匹配。但现有方法多依赖预设相机参数生 ...
首次实现第一视角视频与人体动作同步生成!新框架攻克视角-动作对齐两大技术壁垒
量子位· 2025-09-30 12:22
Core Viewpoint - The article discusses the development of EgoTwin, a framework that successfully generates first-person perspective videos and human actions in a synchronized manner, overcoming significant challenges in perspective-action alignment and causal coupling. Group 1: Challenges in First-Person Video Generation - The essence of first-person video generation is the visual record driven by human actions, where head movements determine camera position and orientation, while full-body actions affect body posture and surrounding scene changes [9]. - Two main challenges are identified: 1. Perspective alignment, where the camera trajectory in the generated video must precisely match the head trajectory derived from human actions [10]. 2. Causal interaction, where each visual frame provides spatial context for human actions, and newly generated actions alter subsequent visual frames [10]. Group 2: Innovations of EgoTwin - EgoTwin employs a diffusion Transformer architecture to create a "text-video-action" tri-modal joint generation framework, addressing the aforementioned challenges through three key innovations [12]. - The first innovation is a three-channel architecture that allows the action branch to cover only the lower half of the text and video branches, ensuring effective interaction [13]. - The second innovation involves a head-centered action representation, which directly anchors actions to the head joint, achieving precise alignment with first-person observations [20]. - The third innovation is an asynchronous diffusion training framework that balances efficiency and generation quality by adapting to the different sampling rates of video and action modalities [22]. Group 3: Performance Evaluation - EgoTwin's performance was validated using the Nymeria dataset, which includes 170,000 five-second "text-video-action" triplets captured by first-person Aria glasses [32]. - The evaluation metrics included traditional video and action quality indicators, as well as newly proposed consistency metrics [31]. - Results showed that EgoTwin significantly outperformed baseline models across nine metrics, with notable improvements in perspective alignment error and hand visibility consistency [32][33]. Group 4: Applications and Implications - EgoTwin not only reduces cross-modal errors but also provides a foundational generation platform for applications in wearable interaction, AR content creation, and embodied intelligent agent simulation [34].