多模态图片检索
Search documents
牛津VGG、港大、上交发布ELIP:超越CLIP等,多模态图片检索的增强视觉语言大模型预训练
机器之心· 2025-10-29 11:02
Core Insights - The article discusses the significance of multimodal image retrieval in computer vision and multimodal machine learning, highlighting the use of large-scale pre-trained models like CLIP and SigLIP for enhanced zero-shot capabilities [2] - A new method called ELIP (Enhance Language-Image Pre-training) is proposed to improve the performance of visual-language models for text-image retrieval, which has been accepted as a best paper nominee at the IEEE International Conference on Content-Based Multimedia Indexing [2] Method Overview - The ELIP method involves an initial ranking of images using traditional CLIP/SigLIP, followed by a re-ranking of the top-k candidates using a simple MLP mapping network that incorporates text features into the image encoder [5] - ELIP can be applied to various large models, including CLIP, SigLIP, and BLIP-2, referred to as ELIP-C, ELIP-S, ELIP-S-2, and ELIP-B respectively [5] Challenges in Academic Research - The article notes that pre-training visual-language models is typically an industrial endeavor, but the proposed method allows for training with limited resources, such as two GPUs [8] Innovations in Model Architecture - The architecture innovation involves fixing the weights of large image and text encoders while training only the MLP mapping network, which consists of three layers of linear transformations and GeLU activations [9] - The training process involves mapping text features to the visual feature space to guide image encoding, using InfoNCE loss for CLIP and Sigmoid loss for SigLIP [9] Innovations in Training Data - ELIP addresses the challenge of limited GPU resources by creating hard sample training batches from CLIP feature similarities, enhancing the model's discriminative ability [13] - The article provides examples of how similar features are grouped to form hard samples for training [15] New Evaluation Datasets - In addition to standard datasets like COCO and Flickr, two new out-of-distribution (OOD) datasets, Occluded COCO and ImageNet-R, are introduced to evaluate the model's performance under different conditions [18] Experimental Results - The results indicate significant improvements in image retrieval performance for models using ELIP, with ELIP-S achieving a recall@1 of 61.03 on COCO, compared to 54.21 for SigLIP [21] - ELIP-B applied to BLIP-2 also shows enhanced performance, surpassing the latest Q-Pert method [20] Attention Mechanism Observations - The authors observed that ELIP improves the attention of the CLS token towards relevant areas in images when the text query is related, enhancing information extraction [23]