一张图0.1秒生成上半身3D化身！清华IDEA新框架入选ICCV 2025

Core Viewpoint - The article discusses the introduction of GUAVA, a novel framework developed by researchers from Tsinghua University and IDEA, which enables the creation of upper-body 3D avatars from a single image in just 0.1 seconds, without the need for multi-view videos or individual training [1][5][36]. Summary by Sections Introduction - GUAVA is recognized for its ability to create realistic and expressive upper-body avatars, which is valuable in fields such as film, gaming, and virtual meetings [4]. Challenges and Innovations - Creating avatars from a single image has been a significant challenge, particularly in achieving real-time rendering and ease of creation. GUAVA addresses these challenges by allowing inference reconstruction in seconds and supporting real-time animation [5][6]. Methodology - GUAVA introduces the Expressive Human Model (EHM) to enhance facial expression capture, overcoming limitations of existing models [12][36]. - The framework employs a two-branch model for avatar reconstruction, combining a "template Gaussian" and a "UV Gaussian" to maintain geometric structure while capturing detailed textures [14][15]. - Real-time animation is achieved by deforming the Ubody Gaussian based on new pose parameters, followed by optimization through a neural refiner to enhance rendering quality [16][17]. Experimental Results - The dataset for experiments included over 620,000 frames, focusing on upper-body videos, with evaluations based on ID consistency, efficiency, and viewpoint control [18][20]. - GUAVA outperformed existing 2D and 3D methods in rendering quality and efficiency, achieving approximately 50 FPS and a reconstruction time of around 0.1 seconds [22][23]. - In self-reenactment scenarios, GUAVA showed superior performance across all metrics compared to 2D methods, while also maintaining ID consistency in cross-reenactment scenarios [22][25]. Conclusion - GUAVA represents a significant advancement in the field of 3D avatar reconstruction, demonstrating improved rendering quality and efficiency over existing methods, with a reconstruction time of approximately 0.1 seconds and support for real-time animation [36][37].