视觉模型既懂语义,又能还原细节,南洋理工&商汤提出棱镜假说
机器之心·2026-01-13 10:04

Core Insights - The article introduces the Prism Hypothesis and Unified Autoencoding (UAE), aiming to harmonize semantic and pixel representations by addressing the conflict between semantic understanding and detail reconstruction [2][5][10]. Background - The challenge of achieving both semantic understanding and detail restoration in visual foundational models is highlighted, as many systems are forced to combine two separate representations, leading to decreased training efficiency and interference [3][4]. Key Concepts - The Prism Hypothesis posits that the representation of information in the world must allow for both shared semantics and the retention of fine-grained details [4][5]. - Semantic encoders (e.g., DINOv2, CLIP) excel in abstract information, while pixel encoders (e.g., SD series VAE) are better at reconstructing details like textures and edges [5][10]. Methodology - Unified Autoencoding (UAE) aims to synthesize both representations by structuring the learning of multi-frequency latent variables, separating the roles of semantics and details [11][13]. - The method involves: 1. Unified Encoder: Initialized from a semantic model to transition into a unified latent space [14]. 2. Residual Split Flow: Employs FFT for frequency band projection and iterative residual splitting to decompose latent variables into multiple frequency bands [15]. 3. Frequency Band Modulator: Perturbs only the high-frequency details and integrates them for the decoder [16]. 4. Semantic-wise Loss: Applies semantic constraints only to the lowest frequency bands, allowing for detail learning in higher frequencies [17]. Experimental Results - UAE demonstrates superior reconstruction quality on ImageNet and MS-COCO datasets, achieving PSNR=33.08, SSIM=0.94, and rFID=0.16 on ImageNet, and PSNR=32.84, SSIM=0.94, and rFID=0.17 on MS-COCO [19][20]. - Compared to the RAE baseline, UAE shows higher PSNR/SSIM and a reduction in rFID by over 90% [20]. - In conditional generation tasks on ImageNet, UAE achieves gFID=1.68 and IS=301.6 [25]. - For semantic understanding, UAE reaches a Top-1 accuracy of 83.0% on ImageNet-1K, matching RAE's performance [26][27].