AAAI 2026 Oral|快手提出全新「检索数据引擎」CroPS,打破搜索信息茧房

Core Insights - The article discusses the introduction of a new retrieval data engine called CroPS (Cross-Perspective Positive Samples) by Kuaishou's search team, aimed at improving short video search capabilities by addressing the limitations of traditional self-reinforcing training paradigms that rely heavily on historical click data [2][10]. Group 1: Problem Identification - Current vector retrieval models in the industry often depend on historical user interaction data, leading to a self-reinforcing cycle that narrows the search results and limits exposure to diverse content [6]. - This mechanism results in a significant sample bias, where high-quality long-tail content is systematically excluded from positive samples, causing the model's retrieval scope to become conservative and repetitive [6][7]. - Users experience a lack of novelty in search results, making it difficult to satisfy exploratory needs [7]. Group 2: CroPS Framework - CroPS introduces a multi-dimensional positive sample enhancement engine that utilizes user query behavior, recommendation system feedback, and knowledge from large language models (LLMs) to enrich the semantic space [11]. - The framework captures user intent continuity by analyzing query rewrites, allowing the system to correct semantic biases by incorporating successful interactions from related queries [12]. - It breaks down barriers between search and recommendation systems, enabling the retrieval model to leverage diverse content that users may not have actively searched for [15]. - CroPS employs LLMs to generate high-quality synthetic samples when existing content does not cover certain queries, effectively expanding the model's knowledge base [16][17]. Group 3: Hierarchical Labeling and Loss Function - The Hierarchical Label Assignment (HLA) strategy addresses the reliability differences among positive samples from various sources, allowing the model to prioritize more relevant samples during training [19]. - H-InfoNCE loss function enhances the model's ability to distinguish between high-priority and low-priority samples, aligning learning objectives with the hierarchical logic of HLA [23][28]. Group 4: Experimental Results - Offline experiments showed that CroPS improved recall rates by 9.5% on user click datasets and 7.1% on user query change datasets compared to the strongest baseline [30]. - In large-scale A/B testing, CroPS led to significant business growth, with a 40.9% increase in ratio rank and a 44.3% increase in ratio show for dense models [31]. - The click-through rate (CTR) increased by 0.869%, and the long playback rate (LPR) rose by 0.483%, indicating improved content relevance and quality [36]. Group 5: Future Directions - The Kuaishou search team plans to explore the integration of CroPS with generative retrieval methods to further leverage the potential of large-scale language models in the search process [34].