多模态检索 - filings, earnings calls, financial reports, news

多模态检索

Search documents

CPU系列研究-行业专家视角-Agent-AI时代下CPU产业机会-互联网大厂专家

2026-01-26 02:49

Summary of Key Points from Conference Call Industry Overview - The conference call focuses on the **CPU industry** and its evolving dynamics in the context of the **Agent AI era**. The demand for high-performance, multi-core CPUs is significantly increasing due to various AI applications such as PPT generation and AI programming, which require substantial computational resources [1][4]. Core Insights and Arguments - **Increased Demand for CPUs**: The rise of Agent applications has led to a notable increase in CPU demand, particularly in tasks like generating PPTs where multiple web pages need to be processed simultaneously, consuming up to 100 physical cores for paid users [3][4]. - **Resource Allocation**: Major companies are not merely increasing CPU numbers but are optimizing resource allocation by creating specialized CPU clusters to handle the growing computational demands efficiently [8]. - **Types of CPU Clusters**: The Agent services are developing into three types of computational resource pools: GPU clusters, working CPU clusters, and scheduling CPU clusters, each serving distinct functions [10]. - **Investment in Intel**: NV's investment in Intel aims to enhance server architecture to improve GPU utilization and increase the demand for high-performance CPUs, particularly in scheduling tasks [13]. - **Price Increases**: The current price hikes in CPUs are attributed to limited production capacity of high-performance CPUs from Intel and AMD, coupled with increased demand driven by AI applications [14]. Additional Important Content - **Low Activity in Domestic Applications**: Domestic assistant applications like PPT generation are currently experiencing low user activity due to insufficient GPU and CPU resources, leading to limitations on free user access [6][7]. - **User Willingness to Pay**: Users are willing to pay for high-performance computing primarily for speed and efficiency, necessitating the construction of large working CPU clusters [11]. - **CPU and GPU Coordination**: In scenarios requiring cross-application tasks, the demand for CPUs is higher as they manage backend operations, while GPUs handle simpler tasks [18]. - **Trends in Task Allocation**: There is a trend towards shifting traditional CPU tasks to GPUs, especially in database queries and multi-modal retrieval, which could lead to increased GPU demand [23][24]. - **Future Demand for CPUs**: Despite the potential for more tasks to be GPU-optimized, new CPU demands will arise from these transitions, indicating a continuous need for CPU resources [24][25]. Conclusion - The CPU industry is undergoing significant changes driven by the rise of AI applications, leading to increased demand for high-performance CPUs and a shift in how computational resources are managed and allocated. The interplay between CPU and GPU resources will continue to evolve as new applications and technologies emerge.

让大模型学会“高维找茬”，中国联通新研究解决长文本图像检索痛点｜AAAI 2026 Oral

量子位· 2025-12-01 05:45

Core Insights - The article discusses a new state-of-the-art (SOTA) model for long-text image retrieval called HiMo-CLIP, developed by the China Unicom Data Science and AI Research Institute, which addresses limitations in existing models like CLIP by effectively capturing semantic differences in context [2][4]. Group 1: Model Limitations - Existing models, including Long-CLIP, struggle with long text descriptions, often resulting in decreased alignment scores as the text becomes more detailed, indicating a failure to process the hierarchical structure of language [6][9]. - The phenomenon where longer descriptions lead to lower alignment scores highlights the inadequacy of current models in distinguishing core semantics from detailed information [6][9]. Group 2: HiMo-CLIP Framework - HiMo-CLIP introduces a plug-and-play representation framework that includes two core components: Hierarchical Decomposition (HiDe) and Monotonicity-aware Contrastive Loss (MoLo) [10][12]. - HiDe dynamically extracts semantic components using PCA within batches, while MoLo enforces alignment between the full text and its semantic components, ensuring monotonicity [12][17]. Group 3: Performance and Efficiency - HiMo-CLIP demonstrates significant advantages in both long and short text retrieval tasks, outperforming models trained on much larger datasets, achieving SOTA with only 1 million training samples [17][20]. - The model's ability to extract unique features from complex scenes allows it to maintain high performance across various retrieval benchmarks [18][22]. Group 4: Evaluation Metrics - The research team constructed the HiMo-Docci dataset and introduced the HiMo@K metric to quantify the model's understanding of hierarchical structures, achieving a high monotonicity correlation coefficient of 0.88, surpassing comparative methods [22][25]. - As text descriptions become more complete, HiMo-CLIP's scores show a consistent upward trend, while other models exhibit significant fluctuations [25][26].

多模态检索新突破，用软标签打破传统刚性映射约束，全面超越CLIP｜AAAI 2026 Oral

量子位· 2025-11-15 05:00

Core Insights - The article discusses the introduction of a new unified multimodal embedding model, UniME-V2, which addresses limitations in existing methods for negative sample mining and enhances semantic understanding through a novel mechanism called "MLLM-as-a-Judge" [3][9]. Group 1: Model Overview - UniME-V2 is designed to improve the training process by constructing a potential difficult negative sample set through global retrieval and evaluating query-candidate pairs using MLLM to generate soft semantic matching scores [3][4][9]. - The model aligns similarity matrices with soft semantic matching scores, significantly enhancing its ability to discern semantic differences among candidate samples [5][6]. Group 2: Methodology - The methodology involves a two-step process: first, constructing a potential difficult negative sample set, and second, using MLLM to assess semantic alignment and generate matching scores [13][14][15]. - A re-ranking model, UniME-V2-Reranker, is trained based on the mined difficult negatives, employing a paired and listwise joint optimization strategy to further enhance performance [6][30]. Group 3: Performance Evaluation - UniME-V2 demonstrates significant performance improvements over existing baseline models, achieving higher scores in various tasks, including a 3.5% and 2.2% increase over VLM2Vec for the Qwen2-VL-2B and 7B models, respectively [36][37]. - The model shows robust performance on out-of-distribution datasets, scoring 66.7, indicating its strong transferability and robustness [38]. Group 4: Cross-Modal Retrieval - In zero-shot cross-modal retrieval tasks, UniME-V2 outperforms previous models, showing a 2.2%-9.7% improvement in image-to-text retrieval and significant enhancements in long description tasks [41][42]. - The model's ability to distinguish difficult negative samples is highlighted, with performance improvements of 5.3%, 6.0%, and 4.5% when using Qwen2-VL-2B, and 9.0%, 9.2%, and 9.2% when scaled to 7B [47][48]. Group 5: Re-ranking Performance - The re-ranking performance of UniME-V2-Reranker surpasses that of LamRA, achieving better results across four downstream tasks while using only half the data [52]. - The model's advantage in complex understanding retrieval tasks is attributed to its effective extraction of diverse and high-quality difficult samples, enhancing its discriminative capabilities [53].

多模态检索

软标签机制

Artificial Intelligence

Artificial Intelligence

统一多模态嵌入模型UniME-V2

UniME-V2-Reranker

打破跨模态干扰，快手东北大学联合提出统一多模态框架，横扫多模态检索基准

量子位· 2025-06-08 03:40

UNITE团队投稿量子位 | 公众号 QbitAI 多模态检索是信息理解与获取的关键技术，但其中的跨模态干扰问题一直是一大难题。可行的解决办法是构建一种统一的多模态表示方式，为此，来自快手与东北大学的研究人员推出了多模态统一嵌入框架——UNITE 。 UNITE的核心目标，就是构建一个能同时处理文本、图像、视频及其融合模态输入的统一嵌入器。它从数据策划与训练机制两个关键视角出发，用对比学习的机制重新定义了统一多模态表示学习的范式。在细粒度检索、指令检索等多个评测中，UNITE框架都斩获了最佳成绩。给定一个批次中个query，每个query（）对应一个正样本和个负样本，构造相似度矩阵：模态感知对比学习，缓解跨模态干扰在多模态检索任务中，不同模态（文本、图像、视频）天然存在分布差异。如果在训练时将所有模态混合进行对比学习，会导致表示空间产生语义扭曲或干扰噪声，影响模型对各模态语义的准确建模。为了解决这一挑战，UNITE团队提出了 Modal-Aware Masked Contrastive Learning （MAMCL）这一对比学习机制，能显著缓解跨模态"相互干扰"。 | ...