多模态检索新突破,用软标签打破传统刚性映射约束,全面超越CLIP|AAAI 2026 Oral
量子位·2025-11-15 05:00

Core Insights - The article discusses the introduction of a new unified multimodal embedding model, UniME-V2, which addresses limitations in existing methods for negative sample mining and enhances semantic understanding through a novel mechanism called "MLLM-as-a-Judge" [3][9]. Group 1: Model Overview - UniME-V2 is designed to improve the training process by constructing a potential difficult negative sample set through global retrieval and evaluating query-candidate pairs using MLLM to generate soft semantic matching scores [3][4][9]. - The model aligns similarity matrices with soft semantic matching scores, significantly enhancing its ability to discern semantic differences among candidate samples [5][6]. Group 2: Methodology - The methodology involves a two-step process: first, constructing a potential difficult negative sample set, and second, using MLLM to assess semantic alignment and generate matching scores [13][14][15]. - A re-ranking model, UniME-V2-Reranker, is trained based on the mined difficult negatives, employing a paired and listwise joint optimization strategy to further enhance performance [6][30]. Group 3: Performance Evaluation - UniME-V2 demonstrates significant performance improvements over existing baseline models, achieving higher scores in various tasks, including a 3.5% and 2.2% increase over VLM2Vec for the Qwen2-VL-2B and 7B models, respectively [36][37]. - The model shows robust performance on out-of-distribution datasets, scoring 66.7, indicating its strong transferability and robustness [38]. Group 4: Cross-Modal Retrieval - In zero-shot cross-modal retrieval tasks, UniME-V2 outperforms previous models, showing a 2.2%-9.7% improvement in image-to-text retrieval and significant enhancements in long description tasks [41][42]. - The model's ability to distinguish difficult negative samples is highlighted, with performance improvements of 5.3%, 6.0%, and 4.5% when using Qwen2-VL-2B, and 9.0%, 9.2%, and 9.2% when scaled to 7B [47][48]. Group 5: Re-ranking Performance - The re-ranking performance of UniME-V2-Reranker surpasses that of LamRA, achieving better results across four downstream tasks while using only half the data [52]. - The model's advantage in complex understanding retrieval tasks is attributed to its effective extraction of diverse and high-quality difficult samples, enhancing its discriminative capabilities [53].