多视角图像生成

Search documents
上海期智&清华!BEV-VAE:首个自监督BEV视角的VAE,从图像到场景生成跃迁~
自动驾驶之心· 2025-07-08 12:45
Core Viewpoint - The article discusses the BEV-VAE method, which enables precise generation and manipulation of multi-view images in autonomous driving, emphasizing the importance of structured representation for understanding three-dimensional scenes [2][4][28]. Group 1: Methodology - BEV-VAE employs a variational autoencoder (VAE) to learn a compact and unified bird's-eye view (BEV) latent space, followed by a Diffusion Transformer for generating spatially consistent multi-view images [2][7]. - The model supports generating images from any camera configuration while incorporating three-dimensional layout information for control [2][11]. - The architecture consists of an encoder, decoder, and a StyleGAN discriminator, ensuring spatial consistency among images from different views [7][8]. Group 2: Advantages - BEV-VAE provides a structured representation that captures the complete semantics and spatial structure of multi-view images, simplifying the construction of world models [28]. - The model decouples spatial modeling from generative modeling, enhancing the efficiency of the learning process [28]. - It is compatible with various camera configurations, demonstrating cross-platform applicability [28]. Group 3: Experimental Results - Experiments on the nuScenes and Argoverse 2 (AV2) datasets show that BEV-VAE outperforms existing models in multi-view image reconstruction and generation tasks [21][22]. - The model's performance improves with higher latent dimensions, achieving a PSNR of 26.32 and an SSIM of 0.7455 at a latent shape of 32 × 32 × 32 [22]. - BEV-VAE allows for fine-grained editing of objects in scenes, successfully learning the three-dimensional structure and complete semantics of the environment [18][19]. Group 4: Conclusion - BEV-VAE significantly lowers the barriers for applying generative models in autonomous driving, enabling researchers to participate in building and expanding world models with lower costs and higher efficiency [28].