Workflow
SigLIP
icon
Search documents
牛津VGG、港大、上交发布ELIP:超越CLIP等,多模态图片检索的增强视觉语言大模型预训练
机器之心· 2025-10-29 11:02
Core Insights - The article discusses the significance of multimodal image retrieval in computer vision and multimodal machine learning, highlighting the use of large-scale pre-trained models like CLIP and SigLIP for enhanced zero-shot capabilities [2] - A new method called ELIP (Enhance Language-Image Pre-training) is proposed to improve the performance of visual-language models for text-image retrieval, which has been accepted as a best paper nominee at the IEEE International Conference on Content-Based Multimedia Indexing [2] Method Overview - The ELIP method involves an initial ranking of images using traditional CLIP/SigLIP, followed by a re-ranking of the top-k candidates using a simple MLP mapping network that incorporates text features into the image encoder [5] - ELIP can be applied to various large models, including CLIP, SigLIP, and BLIP-2, referred to as ELIP-C, ELIP-S, ELIP-S-2, and ELIP-B respectively [5] Challenges in Academic Research - The article notes that pre-training visual-language models is typically an industrial endeavor, but the proposed method allows for training with limited resources, such as two GPUs [8] Innovations in Model Architecture - The architecture innovation involves fixing the weights of large image and text encoders while training only the MLP mapping network, which consists of three layers of linear transformations and GeLU activations [9] - The training process involves mapping text features to the visual feature space to guide image encoding, using InfoNCE loss for CLIP and Sigmoid loss for SigLIP [9] Innovations in Training Data - ELIP addresses the challenge of limited GPU resources by creating hard sample training batches from CLIP feature similarities, enhancing the model's discriminative ability [13] - The article provides examples of how similar features are grouped to form hard samples for training [15] New Evaluation Datasets - In addition to standard datasets like COCO and Flickr, two new out-of-distribution (OOD) datasets, Occluded COCO and ImageNet-R, are introduced to evaluate the model's performance under different conditions [18] Experimental Results - The results indicate significant improvements in image retrieval performance for models using ELIP, with ELIP-S achieving a recall@1 of 61.03 on COCO, compared to 54.21 for SigLIP [21] - ELIP-B applied to BLIP-2 also shows enhanced performance, surpassing the latest Q-Pert method [20] Attention Mechanism Observations - The authors observed that ELIP improves the attention of the CLS token towards relevant areas in images when the text query is related, enhancing information extraction [23]
谢赛宁新作:VAE退役,RAE当立
量子位· 2025-10-14 08:16
Core Viewpoint - The era of Variational Autoencoders (VAE) is coming to an end, with Representation Autoencoders (RAE) set to take over in the field of diffusion models [1][3]. Summary by Sections RAE Introduction - RAE is a new type of autoencoder designed for training diffusion Transformers (DiT), utilizing pre-trained representation encoders (like DINO, SigLIP, MAE) paired with lightweight decoders, replacing the traditional VAE [3][9]. Advantages of RAE - RAE provides high-quality reconstruction results and a semantically rich latent space, supporting scalable transformer-based architectures. It achieves faster convergence without the need for additional representation alignment losses [4][10]. Performance Metrics - At a resolution of 256×256, the FID score without guidance is 1.51, and with guidance, it is 1.13 for both 256×256 and 512×512 resolutions [6]. Limitations of VAE - VAE has outdated backbone networks, leading to overly complex architectures, requiring 450 GFLOPs compared to only 22 GFLOPs for a simple ViT-B encoder [7]. - The compressed latent space of VAE (only 4 channels) severely limits information capacity, resulting in minimal improvement in information carrying ability [7]. - VAE's weak representation capability, relying solely on reconstruction training, leads to low feature quality and slows down convergence, negatively impacting generation quality [7]. RAE's Design and Training - RAE combines pre-trained representation encoders with trained decoders without requiring additional training or alignment phases, and it does not introduce auxiliary loss functions [9]. - RAE outperforms SD-VAE in reconstruction quality despite its simplicity [10]. Model Comparisons - RAE models such as DINOv2-B, SigLIP2-B, and MAE-B show significant improvements in rFID and Top-1 accuracy compared to SD-VAE [11]. Adjustments for Diffusion Models - RAE requires simple adjustments for effective performance in high-dimensional spaces, including a wide DiT design, noise scheduling, and noise injection in the decoder training [13][17]. - The DiT-XL model trained with RAE surpasses REPA without any auxiliary losses or additional training phases, achieving convergence speeds up to 16 times faster than REPA based on SD-VAE [18][19]. Scalability and Efficiency - The new architecture enhances the scalability of DiT in terms of training computation and model size, outperforming both standard DiT based on RAE and traditional methods based on VAE [24].