向量检索

Search documents
什么是倒排索引(Inverted Index)?
Sou Hu Cai Jing· 2025-09-04 04:14
StarRocks作为新一代实时分析数据库,在倒排索引技术方面展现出显著优势。系统原生支持全文检索功能,通过优化的倒排索引结构实现高效的文本数据 查询。在向量检索场景下,StarRocks能够无缝整合传统倒排索引与向量相似性搜索,为RAG应用提供统一的数据底座。 倒排索引(Inverted Index)是一种将每个词项映射到包含该词项的文档列表的索引结构,与传统正向索引恰好相反。正向索引通过文档ID查找其内容,而倒 排索引则通过关键词快速定位包含该词的所有文档。这种设计思路源于实际应用中需要根据属性值查找记录的需求,特别适用于全文检索、搜索引擎和大规 模数据分析场景。 倒排索引的构建过程包括文本预处理、词典生成和倒排记录表创建三个核心步骤。以三个文档为例:Doc1包含"quick brown fox",Doc2包含"lazy dog", Doc3包含"quick brown dog"。经过分词处理后,系统会为每个词项建立对应的文档列表,如"quick"对应[Doc1, Doc3],"dog"对应[Doc2, Doc3],从而实现快 速检索。 倒排索引技术广泛应用于多个数据处理领域,展现出强大的实用价值。在全文 ...
只改2行代码,RAG效率暴涨30%!多种任务适用,可扩展至百亿级数据规模应用
量子位· 2025-06-20 10:31
Core Viewpoint - The article discusses a new open-source method called PSP (Proximity graph with Spherical Pathway) developed by a team from Zhejiang University, which significantly improves the efficiency of RAG vector retrieval by 30% with just two lines of code. This method is applicable to various tasks such as text-to-text, image-to-image, text-to-image, and recommendation system recall, and is scalable for large-scale applications involving billions of data points [1]. Group 1: Vector Retrieval Methodology - Traditional vector retrieval methods are primarily based on Euclidean distance, focusing on "who is closest," while AI often requires comparison of "semantic relevance," which is better represented by maximum inner product [2]. - Previous inner product retrieval methods failed to satisfy the mathematical triangle inequality, leading to inefficiencies [3]. - The PSP method allows for minor modifications to existing graph structures to find optimal solutions for maximum inner product retrieval [4]. Group 2: Technical Innovations - PSP incorporates an early stopping strategy to determine when to end the search, thus conserving computational resources and speeding up the search process [5]. - The combination of vector models and vector databases is crucial for maximizing the potential of this technology, with the choice of "metric space" being a key factor [6]. - Many existing graph-based vector retrieval algorithms, such as HNSW and NSG, are designed for Euclidean space, which can lead to "metric mismatch" issues in scenarios better suited for maximum inner product retrieval [7]. Group 3: Algorithmic Insights - The research identifies two paradigms in maximum inner product retrieval: converting maximum inner product to minimum Euclidean distance, which often results in information loss, and directly searching in inner product space, which lacks effective pruning methods [8]. - The challenge in direct inner product space retrieval lies in its failure to meet the criteria of a strict "metric space," particularly the absence of the triangle inequality [9]. - The PSP team demonstrated that a greedy algorithm can find the global optimal maximum inner product solution on a graph index designed for Euclidean distance [10]. Group 4: Practical Applications and Performance - The PSP method modifies the candidate point queue settings and distance metrics to optimize search behavior and avoid redundant calculations [13]. - The search behavior for maximum inner product differs significantly from that in Euclidean space, often requiring a search pattern that expands from the inside out [16]. - The team conducted extensive tests on eight large-scale, high-dimensional datasets, demonstrating that PSP outperforms existing state-of-the-art methods in terms of stability and efficiency [21][23]. Group 5: Scalability and Generalization - The datasets used for testing included various modalities such as text-to-text, image-to-image, and recommendation system recall, showcasing the strong generalization capabilities of PSP [25]. - PSP exhibits excellent scalability, with time complexity showing logarithmic growth rates, making it suitable for efficient retrieval in datasets containing billions to hundreds of billions of points [26].