多模态检索
Search documents
CPU系列研究-行业专家视角-Agent-AI时代下CPU产业机会-互联网大厂专家
2026-01-26 02:49
Summary of Key Points from Conference Call Industry Overview - The conference call focuses on the **CPU industry** and its evolving dynamics in the context of the **Agent AI era**. The demand for high-performance, multi-core CPUs is significantly increasing due to various AI applications such as PPT generation and AI programming, which require substantial computational resources [1][4]. Core Insights and Arguments - **Increased Demand for CPUs**: The rise of Agent applications has led to a notable increase in CPU demand, particularly in tasks like generating PPTs where multiple web pages need to be processed simultaneously, consuming up to 100 physical cores for paid users [3][4]. - **Resource Allocation**: Major companies are not merely increasing CPU numbers but are optimizing resource allocation by creating specialized CPU clusters to handle the growing computational demands efficiently [8]. - **Types of CPU Clusters**: The Agent services are developing into three types of computational resource pools: GPU clusters, working CPU clusters, and scheduling CPU clusters, each serving distinct functions [10]. - **Investment in Intel**: NV's investment in Intel aims to enhance server architecture to improve GPU utilization and increase the demand for high-performance CPUs, particularly in scheduling tasks [13]. - **Price Increases**: The current price hikes in CPUs are attributed to limited production capacity of high-performance CPUs from Intel and AMD, coupled with increased demand driven by AI applications [14]. Additional Important Content - **Low Activity in Domestic Applications**: Domestic assistant applications like PPT generation are currently experiencing low user activity due to insufficient GPU and CPU resources, leading to limitations on free user access [6][7]. - **User Willingness to Pay**: Users are willing to pay for high-performance computing primarily for speed and efficiency, necessitating the construction of large working CPU clusters [11]. - **CPU and GPU Coordination**: In scenarios requiring cross-application tasks, the demand for CPUs is higher as they manage backend operations, while GPUs handle simpler tasks [18]. - **Trends in Task Allocation**: There is a trend towards shifting traditional CPU tasks to GPUs, especially in database queries and multi-modal retrieval, which could lead to increased GPU demand [23][24]. - **Future Demand for CPUs**: Despite the potential for more tasks to be GPU-optimized, new CPU demands will arise from these transitions, indicating a continuous need for CPU resources [24][25]. Conclusion - The CPU industry is undergoing significant changes driven by the rise of AI applications, leading to increased demand for high-performance CPUs and a shift in how computational resources are managed and allocated. The interplay between CPU and GPU resources will continue to evolve as new applications and technologies emerge.
让大模型学会“高维找茬”,中国联通新研究解决长文本图像检索痛点|AAAI 2026 Oral
量子位· 2025-12-01 05:45
Core Insights - The article discusses a new state-of-the-art (SOTA) model for long-text image retrieval called HiMo-CLIP, developed by the China Unicom Data Science and AI Research Institute, which addresses limitations in existing models like CLIP by effectively capturing semantic differences in context [2][4]. Group 1: Model Limitations - Existing models, including Long-CLIP, struggle with long text descriptions, often resulting in decreased alignment scores as the text becomes more detailed, indicating a failure to process the hierarchical structure of language [6][9]. - The phenomenon where longer descriptions lead to lower alignment scores highlights the inadequacy of current models in distinguishing core semantics from detailed information [6][9]. Group 2: HiMo-CLIP Framework - HiMo-CLIP introduces a plug-and-play representation framework that includes two core components: Hierarchical Decomposition (HiDe) and Monotonicity-aware Contrastive Loss (MoLo) [10][12]. - HiDe dynamically extracts semantic components using PCA within batches, while MoLo enforces alignment between the full text and its semantic components, ensuring monotonicity [12][17]. Group 3: Performance and Efficiency - HiMo-CLIP demonstrates significant advantages in both long and short text retrieval tasks, outperforming models trained on much larger datasets, achieving SOTA with only 1 million training samples [17][20]. - The model's ability to extract unique features from complex scenes allows it to maintain high performance across various retrieval benchmarks [18][22]. Group 4: Evaluation Metrics - The research team constructed the HiMo-Docci dataset and introduced the HiMo@K metric to quantify the model's understanding of hierarchical structures, achieving a high monotonicity correlation coefficient of 0.88, surpassing comparative methods [22][25]. - As text descriptions become more complete, HiMo-CLIP's scores show a consistent upward trend, while other models exhibit significant fluctuations [25][26].
多模态检索新突破,用软标签打破传统刚性映射约束,全面超越CLIP|AAAI 2026 Oral
量子位· 2025-11-15 05:00
Core Insights - The article discusses the introduction of a new unified multimodal embedding model, UniME-V2, which addresses limitations in existing methods for negative sample mining and enhances semantic understanding through a novel mechanism called "MLLM-as-a-Judge" [3][9]. Group 1: Model Overview - UniME-V2 is designed to improve the training process by constructing a potential difficult negative sample set through global retrieval and evaluating query-candidate pairs using MLLM to generate soft semantic matching scores [3][4][9]. - The model aligns similarity matrices with soft semantic matching scores, significantly enhancing its ability to discern semantic differences among candidate samples [5][6]. Group 2: Methodology - The methodology involves a two-step process: first, constructing a potential difficult negative sample set, and second, using MLLM to assess semantic alignment and generate matching scores [13][14][15]. - A re-ranking model, UniME-V2-Reranker, is trained based on the mined difficult negatives, employing a paired and listwise joint optimization strategy to further enhance performance [6][30]. Group 3: Performance Evaluation - UniME-V2 demonstrates significant performance improvements over existing baseline models, achieving higher scores in various tasks, including a 3.5% and 2.2% increase over VLM2Vec for the Qwen2-VL-2B and 7B models, respectively [36][37]. - The model shows robust performance on out-of-distribution datasets, scoring 66.7, indicating its strong transferability and robustness [38]. Group 4: Cross-Modal Retrieval - In zero-shot cross-modal retrieval tasks, UniME-V2 outperforms previous models, showing a 2.2%-9.7% improvement in image-to-text retrieval and significant enhancements in long description tasks [41][42]. - The model's ability to distinguish difficult negative samples is highlighted, with performance improvements of 5.3%, 6.0%, and 4.5% when using Qwen2-VL-2B, and 9.0%, 9.2%, and 9.2% when scaled to 7B [47][48]. Group 5: Re-ranking Performance - The re-ranking performance of UniME-V2-Reranker surpasses that of LamRA, achieving better results across four downstream tasks while using only half the data [52]. - The model's advantage in complex understanding retrieval tasks is attributed to its effective extraction of diverse and high-quality difficult samples, enhancing its discriminative capabilities [53].
打破跨模态干扰,快手东北大学联合提出统一多模态框架,横扫多模态检索基准
量子位· 2025-06-08 03:40
Core Viewpoint - The article discusses the development of a unified multimodal embedding framework called UNITE, which addresses the challenge of cross-modal interference in multimodal retrieval tasks [1][2][3]. Group 1: UNITE Framework Overview - UNITE aims to create a unified embedding that can process text, images, videos, and their combinations [3]. - The framework redefines the paradigm of unified multimodal representation learning through contrastive learning mechanisms [4]. - UNITE has achieved the best performance in fine-grained retrieval and instruction retrieval evaluations [5][18]. Group 2: Modal-Aware Masked Contrastive Learning (MAMCL) - MAMCL is introduced to alleviate cross-modal interference by ensuring that only samples with consistent modalities are compared during training [8][11]. - The core idea of MAMCL is to use a modal mask constraint, allowing comparisons only among negative samples that match the current query's target modality [11][15]. - This approach prevents the model from learning incorrect modality similarities, thus reducing semantic distortion [10][12]. Group 3: Training Strategy - UNITE employs a two-phase training strategy: retrieval adaptation and instruction fine-tuning [17]. - The retrieval adaptation phase enhances the model's basic retrieval capabilities using multimodal data, while the instruction fine-tuning phase focuses on complex multimodal instruction tasks [17]. - This strategy significantly improves the model's ability to follow instructions and generalize across tasks [17]. Group 4: Performance Metrics - UNITE has outperformed other models in various benchmarks, including image-text and video-text retrieval tasks [20][21]. - In the MMEB Benchmark, UNITE 7B achieved an optimal performance score of 70.3, surpassing larger models like mmE5 11B and IDMR 26B [25]. - The model demonstrated strong generalization capabilities across standard cross-modal retrieval tasks [26]. Group 5: Key Findings - Video-text data exhibits a "unified modality" core capability, leading to superior performance in video retrieval tasks [29]. - Instruction-based tasks rely more on text-dominant data, highlighting the importance of text-text and text-image data for enhancing language understanding and logical reasoning [30]. - The integration of fine-grained video-text samples during the retrieval adaptation phase significantly optimizes overall performance [30].