异步扩散机制

Search documents
EgoTwin :世界模型首次实现具身「视频+动作」同框生成,时间与空间上精确对齐
具身智能之心· 2025-08-28 01:20
Core Viewpoint - The article discusses the EgoTwin framework, which allows for the simultaneous generation of first-person perspective videos and human actions, achieving precise alignment in both time and space, thus attracting significant attention in the AR/VR, embodied intelligence, and wearable device sectors [2][5]. Summary by Sections Introduction - The EgoTwin framework was developed collaboratively by institutions including the National University of Singapore, Nanyang Technological University, Hong Kong University of Science and Technology, and Shanghai Artificial Intelligence Laboratory [2]. Key Highlights - EgoTwin combines first-person perspective video generation with human action generation, addressing challenges such as camera trajectory alignment with head movement and establishing a causal loop between observation and action [8]. - It utilizes a large dataset of 170,000 segments of first-person multimodal real-world scenes for training, leading to significant performance improvements, including a 48% reduction in trajectory error and a 125% increase in hand visibility F-score [8]. Method Innovations - EgoTwin introduces three core technologies: 1. A head-centric action representation that directly provides head 6D pose, reducing alignment errors [12]. 2. A bidirectional causal attention mechanism that allows for causal interactions between action tokens and video tokens [12]. 3. An asynchronous diffusion mechanism that ensures synchronization while allowing independent noise addition and removal on different timelines [12]. Technical Implementation - The model employs a three-channel diffusion architecture, optimizing computational efficiency by reusing only the necessary layers for the action branch [13]. - The training process involves three phases, including initial training of the action VAE, followed by alignment training, and finally joint fine-tuning of all three modalities [21]. Data and Evaluation - EgoTwin is trained and tested on the Nymeria dataset, which includes 170,000 five-second video clips covering various daily actions [17]. - A comprehensive evaluation system is established to measure generation quality and consistency across the three modalities, utilizing metrics such as I-FID, FVD, and CLIP-SIM [17]. Quantitative Experiments - EgoTwin outperforms baseline methods across all nine evaluation metrics, demonstrating significant improvements in trajectory alignment and hand score, while also enhancing the fidelity and consistency of generated videos and actions [18][19]. Generation Modes - The framework supports three generation modes: T2VM (text to video and motion), TM2V (text and motion to video), and TV2M (text and video to motion), showcasing its versatility [24].