UNITE多模态统一嵌入框架
Search documents
打破跨模态干扰,快手东北大学联合提出统一多模态框架,横扫多模态检索基准
量子位· 2025-06-08 03:40
Core Viewpoint - The article discusses the development of a unified multimodal embedding framework called UNITE, which addresses the challenge of cross-modal interference in multimodal retrieval tasks [1][2][3]. Group 1: UNITE Framework Overview - UNITE aims to create a unified embedding that can process text, images, videos, and their combinations [3]. - The framework redefines the paradigm of unified multimodal representation learning through contrastive learning mechanisms [4]. - UNITE has achieved the best performance in fine-grained retrieval and instruction retrieval evaluations [5][18]. Group 2: Modal-Aware Masked Contrastive Learning (MAMCL) - MAMCL is introduced to alleviate cross-modal interference by ensuring that only samples with consistent modalities are compared during training [8][11]. - The core idea of MAMCL is to use a modal mask constraint, allowing comparisons only among negative samples that match the current query's target modality [11][15]. - This approach prevents the model from learning incorrect modality similarities, thus reducing semantic distortion [10][12]. Group 3: Training Strategy - UNITE employs a two-phase training strategy: retrieval adaptation and instruction fine-tuning [17]. - The retrieval adaptation phase enhances the model's basic retrieval capabilities using multimodal data, while the instruction fine-tuning phase focuses on complex multimodal instruction tasks [17]. - This strategy significantly improves the model's ability to follow instructions and generalize across tasks [17]. Group 4: Performance Metrics - UNITE has outperformed other models in various benchmarks, including image-text and video-text retrieval tasks [20][21]. - In the MMEB Benchmark, UNITE 7B achieved an optimal performance score of 70.3, surpassing larger models like mmE5 11B and IDMR 26B [25]. - The model demonstrated strong generalization capabilities across standard cross-modal retrieval tasks [26]. Group 5: Key Findings - Video-text data exhibits a "unified modality" core capability, leading to superior performance in video retrieval tasks [29]. - Instruction-based tasks rely more on text-dominant data, highlighting the importance of text-text and text-image data for enhancing language understanding and logical reasoning [30]. - The integration of fine-grained video-text samples during the retrieval adaptation phase significantly optimizes overall performance [30].