多模态检索
Search documents
让大模型学会“高维找茬”,中国联通新研究解决长文本图像检索痛点|AAAI 2026 Oral
量子位· 2025-12-01 05:45
Core Insights - The article discusses a new state-of-the-art (SOTA) model for long-text image retrieval called HiMo-CLIP, developed by the China Unicom Data Science and AI Research Institute, which addresses limitations in existing models like CLIP by effectively capturing semantic differences in context [2][4]. Group 1: Model Limitations - Existing models, including Long-CLIP, struggle with long text descriptions, often resulting in decreased alignment scores as the text becomes more detailed, indicating a failure to process the hierarchical structure of language [6][9]. - The phenomenon where longer descriptions lead to lower alignment scores highlights the inadequacy of current models in distinguishing core semantics from detailed information [6][9]. Group 2: HiMo-CLIP Framework - HiMo-CLIP introduces a plug-and-play representation framework that includes two core components: Hierarchical Decomposition (HiDe) and Monotonicity-aware Contrastive Loss (MoLo) [10][12]. - HiDe dynamically extracts semantic components using PCA within batches, while MoLo enforces alignment between the full text and its semantic components, ensuring monotonicity [12][17]. Group 3: Performance and Efficiency - HiMo-CLIP demonstrates significant advantages in both long and short text retrieval tasks, outperforming models trained on much larger datasets, achieving SOTA with only 1 million training samples [17][20]. - The model's ability to extract unique features from complex scenes allows it to maintain high performance across various retrieval benchmarks [18][22]. Group 4: Evaluation Metrics - The research team constructed the HiMo-Docci dataset and introduced the HiMo@K metric to quantify the model's understanding of hierarchical structures, achieving a high monotonicity correlation coefficient of 0.88, surpassing comparative methods [22][25]. - As text descriptions become more complete, HiMo-CLIP's scores show a consistent upward trend, while other models exhibit significant fluctuations [25][26].
多模态检索新突破,用软标签打破传统刚性映射约束,全面超越CLIP|AAAI 2026 Oral
量子位· 2025-11-15 05:00
Core Insights - The article discusses the introduction of a new unified multimodal embedding model, UniME-V2, which addresses limitations in existing methods for negative sample mining and enhances semantic understanding through a novel mechanism called "MLLM-as-a-Judge" [3][9]. Group 1: Model Overview - UniME-V2 is designed to improve the training process by constructing a potential difficult negative sample set through global retrieval and evaluating query-candidate pairs using MLLM to generate soft semantic matching scores [3][4][9]. - The model aligns similarity matrices with soft semantic matching scores, significantly enhancing its ability to discern semantic differences among candidate samples [5][6]. Group 2: Methodology - The methodology involves a two-step process: first, constructing a potential difficult negative sample set, and second, using MLLM to assess semantic alignment and generate matching scores [13][14][15]. - A re-ranking model, UniME-V2-Reranker, is trained based on the mined difficult negatives, employing a paired and listwise joint optimization strategy to further enhance performance [6][30]. Group 3: Performance Evaluation - UniME-V2 demonstrates significant performance improvements over existing baseline models, achieving higher scores in various tasks, including a 3.5% and 2.2% increase over VLM2Vec for the Qwen2-VL-2B and 7B models, respectively [36][37]. - The model shows robust performance on out-of-distribution datasets, scoring 66.7, indicating its strong transferability and robustness [38]. Group 4: Cross-Modal Retrieval - In zero-shot cross-modal retrieval tasks, UniME-V2 outperforms previous models, showing a 2.2%-9.7% improvement in image-to-text retrieval and significant enhancements in long description tasks [41][42]. - The model's ability to distinguish difficult negative samples is highlighted, with performance improvements of 5.3%, 6.0%, and 4.5% when using Qwen2-VL-2B, and 9.0%, 9.2%, and 9.2% when scaled to 7B [47][48]. Group 5: Re-ranking Performance - The re-ranking performance of UniME-V2-Reranker surpasses that of LamRA, achieving better results across four downstream tasks while using only half the data [52]. - The model's advantage in complex understanding retrieval tasks is attributed to its effective extraction of diverse and high-quality difficult samples, enhancing its discriminative capabilities [53].
打破跨模态干扰,快手东北大学联合提出统一多模态框架,横扫多模态检索基准
量子位· 2025-06-08 03:40
UNITE团队 投稿 量子位 | 公众号 QbitAI 多模态检索是信息理解与获取的关键技术,但其中的 跨模态干扰 问题一直是一大难题。 可行的解决办法是 构建一种统一的多模态表示方式 ,为此,来自快手与东北大学的研究人员推出了 多模态统一嵌入框架——UNITE 。 UNITE的核心目标,就是构建一个能同时处理文本、图像、视频及其融合模态输入的统一嵌入器。 它从数据策划与训练机制两个关键视角出发,用对比学习的机制重新定义了统一多模态表示学习的范式。 在细粒度检索、指令检索等多个评测中,UNITE框架都斩获了最佳成绩。 给定一个批次中 个query,每个query( )对应一个正样本 和 个负样本,构造相似度矩阵: 模态感知对比学习,缓解跨模态干扰 在多模态检索任务中,不同模态(文本、图像、视频)天然存在分布差异。 如果在训练时将所有模态混合进行对比学习,会导致表示空间产生语义扭曲或干扰噪声,影响模型对各模态语义的准确建模。 为了解决这一挑战,UNITE团队提出了 Modal-Aware Masked Contrastive Learning (MAMCL)这一对比学习机制,能显著缓解跨模 态"相互干扰"。 | ...