Workflow
向量检索
icon
Search documents
微软研究院路保同:用向量检索重塑模型注意力——Attention
3 6 Ke· 2025-11-17 08:02
Core Insights - The article discusses the limitations of long-context reasoning in large language models (LLMs) due to the quadratic complexity of self-attention and the significant memory requirements for key-value (KV) caching [1][5] - It introduces a new mechanism called Retrieval Attention, which accelerates long-context LLM inference through a dynamic sparse attention approach that does not require retraining [1][8] Group 1: Retrieval Attention Mechanism - Retrieval Attention posits that each query only needs to interact with a small subset of keys, making most attention redundant [3][7] - The approach involves offloading most KV vectors from the GPU to the CPU, using approximate nearest neighbor (ANN) search to identify the most relevant keys for each query [3][7] - This mechanism allows for significant reductions in memory usage, with an 8B model requiring only about 1/10 of the original memory for KV caching while maintaining accuracy [22] Group 2: Performance Metrics - Empirical tests on an RTX 4090 (24GB) show that the 8B model can stably generate with 128K context at approximately 0.188 seconds per token, achieving nearly the same precision as full attention [5][6] - The subsequent work, RetroInfer, demonstrated a 4.5 times increase in decoding throughput on A100 GPUs compared to full attention and a 10.5 times increase in throughput for 1M token contexts compared to other sparse attention systems [5][22] Group 3: System Architecture - The architecture of Retrieval Attention features a dual-path attention mechanism where the GPU retains a small amount of "predictable" local KV cache, while the CPU dynamically retrieves a large-scale KV store [7][8] - This design leads to a reduction in both memory usage and inference latency, allowing for efficient long-context reasoning without retraining the model [8][22] Group 4: Theoretical and Practical Contributions - The work presents a new theoretical perspective by framing the attention mechanism as a retrieval system, allowing for more precise identification of important contextual information [23][25] - It also emphasizes system-level optimizations, transforming traditional linear caching into a dynamic allocation structure that enhances efficiency in large-scale inference scenarios [23][25] Group 5: Future Directions - Future research may focus on establishing a more rigorous theoretical framework for the error bounds of Retrieval Attention and exploring the integration of dynamic learning mechanisms with system-level optimizations [26][30] - The long-term implications of this research could lead to models with true long-term memory capabilities, enabling them to maintain semantic consistency over extensive contexts [30][31]
什么是倒排索引(Inverted Index)?
Sou Hu Cai Jing· 2025-09-04 04:14
Core Insights - Inverted index is a data structure that maps each term to a list of documents containing that term, facilitating quick document retrieval based on keywords [1][3] - The construction of inverted indexes involves three main steps: text preprocessing, dictionary generation, and the creation of inverted record tables [1] - Inverted index technology is widely used in various data processing fields, demonstrating significant practical value, especially in search engines, log analysis systems, and recommendation systems [3] Industry Applications - Elasticsearch and similar systems utilize inverted indexes for millisecond-level text retrieval responses in full-text search engines [3] - Log analysis systems leverage inverted indexes to quickly locate specific error messages or user behavior patterns [3] - The combination of inverted indexes and vector retrieval technology is advancing Retrieval-Augmented Generation (RAG) technology, supporting both exact matching and semantic similarity searches [3] Company Developments - StarRocks, a next-generation real-time analytical database, showcases significant advantages in inverted index technology, supporting full-text search and efficient text data queries [5] - The enterprise version of StarRocks, known as Jingzhou Database, enhances inverted index performance with distributed construction capabilities, handling petabyte-scale indexing tasks [8] - Tencent has adopted StarRocks as the core technology platform for building a large-scale vector retrieval system, overcoming performance and scalability challenges of traditional retrieval solutions [8] Performance Improvements - The solution based on StarRocks has achieved over 80% reduction in query response time compared to traditional methods while supporting larger data processing needs [8] - The optimized inverted index structure and query algorithms in Tencent's system enable complex multidimensional query conditions while maintaining millisecond-level response times [8]
只改2行代码,RAG效率暴涨30%!多种任务适用,可扩展至百亿级数据规模应用
量子位· 2025-06-20 10:31
Core Viewpoint - The article discusses a new open-source method called PSP (Proximity graph with Spherical Pathway) developed by a team from Zhejiang University, which significantly improves the efficiency of RAG vector retrieval by 30% with just two lines of code. This method is applicable to various tasks such as text-to-text, image-to-image, text-to-image, and recommendation system recall, and is scalable for large-scale applications involving billions of data points [1]. Group 1: Vector Retrieval Methodology - Traditional vector retrieval methods are primarily based on Euclidean distance, focusing on "who is closest," while AI often requires comparison of "semantic relevance," which is better represented by maximum inner product [2]. - Previous inner product retrieval methods failed to satisfy the mathematical triangle inequality, leading to inefficiencies [3]. - The PSP method allows for minor modifications to existing graph structures to find optimal solutions for maximum inner product retrieval [4]. Group 2: Technical Innovations - PSP incorporates an early stopping strategy to determine when to end the search, thus conserving computational resources and speeding up the search process [5]. - The combination of vector models and vector databases is crucial for maximizing the potential of this technology, with the choice of "metric space" being a key factor [6]. - Many existing graph-based vector retrieval algorithms, such as HNSW and NSG, are designed for Euclidean space, which can lead to "metric mismatch" issues in scenarios better suited for maximum inner product retrieval [7]. Group 3: Algorithmic Insights - The research identifies two paradigms in maximum inner product retrieval: converting maximum inner product to minimum Euclidean distance, which often results in information loss, and directly searching in inner product space, which lacks effective pruning methods [8]. - The challenge in direct inner product space retrieval lies in its failure to meet the criteria of a strict "metric space," particularly the absence of the triangle inequality [9]. - The PSP team demonstrated that a greedy algorithm can find the global optimal maximum inner product solution on a graph index designed for Euclidean distance [10]. Group 4: Practical Applications and Performance - The PSP method modifies the candidate point queue settings and distance metrics to optimize search behavior and avoid redundant calculations [13]. - The search behavior for maximum inner product differs significantly from that in Euclidean space, often requiring a search pattern that expands from the inside out [16]. - The team conducted extensive tests on eight large-scale, high-dimensional datasets, demonstrating that PSP outperforms existing state-of-the-art methods in terms of stability and efficiency [21][23]. Group 5: Scalability and Generalization - The datasets used for testing included various modalities such as text-to-text, image-to-image, and recommendation system recall, showcasing the strong generalization capabilities of PSP [25]. - PSP exhibits excellent scalability, with time complexity showing logarithmic growth rates, making it suitable for efficient retrieval in datasets containing billions to hundreds of billions of points [26].