Workflow
自回归生成方法
icon
Search documents
ICCV2025 | 多视图生成新范式-利用自回归模型探索多视图生成
机器之心· 2025-07-12 02:11
Core Viewpoint - The article introduces and develops a self-regressive generative multi-view image method called MVAR, aimed at enhancing consistency across multiple views by effectively extracting guiding information from previously generated views [2][3]. Background and Motivation - Generating multi-view images based on artificial instructions is crucial for 3D content creation, with challenges in maintaining consistency and effectively synthesizing shapes and textures across different views [6][7]. - Previous works primarily utilized the multi-view consistency prior inherent in diffusion models, which have inherent limitations, such as difficulty in handling multiple modalities and reduced effectiveness when generating images from distant views [8][10]. MVAR Model - MVAR bridges the gap between pure autoregressive methods and state-of-the-art diffusion-based multi-view image generation methods, becoming capable of handling simultaneous multi-modal conditions [3][15]. - The model leverages autoregressive generation, allowing it to utilize information from previously generated views to enhance the generation of the current view [12][13]. Challenges in Multi-View Generation - The autoregressive model faces challenges such as multi-modal condition control and limited high-quality training data, which hinder its application in multi-view image tasks [24][25]. Solutions Provided by MVAR - MVAR proposes specific solutions to address the challenges, including a multi-modal condition embedding network architecture that incorporates text, camera poses, images, and geometry [26][27]. - The model employs a Shuffle View (ShufV) data augmentation strategy to enhance the limited high-quality data by using different orders of camera paths during training [34][36]. Experimental Results - MVAR narrows the gap between autoregressive multi-view generation models and existing diffusion models, demonstrating stronger instruction adherence and multi-view consistency [41]. - In numerical comparisons with advanced diffusion-based methods, MVAR achieved the highest PSNR (22.99) and second-best SSIM (0.907), although it showed slightly lower performance in the LPIPS perceptual metric [42][44]. Future Work - Future efforts will focus on enhancing performance through the use of continuous causal 3D VAE for multi-view image tokenization and unifying multi-view generation and understanding tasks, particularly in scenarios with limited high-precision 3D data [47].