Workflow
向量检索
icon
Search documents
搞自驾这七年,绝大多数的「数据闭环」都是伪闭环
自动驾驶之心· 2026-01-08 05:58
Core Viewpoint - The concept of "data closed loop" in the autonomous driving industry is still largely limited to small internal loops within algorithm teams, rather than achieving the grand vision of a comprehensive system that directly solves problems through data [1]. Group 1: Definition of "True Data Closed Loop" - A "true closed loop" must meet three levels: automated problem discovery, quantifiable and reviewable solution effects, and a comprehensive trigger system that integrates real-time and historical data [4][5]. - The ideal state involves a system that can automatically classify issues, route them to the appropriate teams, and assist in developing trigger rules, thereby reducing reliance on manual processes [5]. Group 2: Current Industry Practices - Many companies' so-called "data closed loops" are more accurately described as "data-driven development processes with some automation tools," primarily limited to the perspective of individual algorithm teams [8]. - Typical workflows are often module-level and algorithm-focused, lacking a system-wide perspective [9]. Group 3: Reasons for Lack of True Closed Loops - The starting point for many companies is a "passive closed loop," where problems are identified reactively rather than through automated data analysis [10]. - Attribution of issues is often difficult, as multiple interrelated factors contribute to the same phenomenon [12]. - The data-to-solution chain often stops at data-to-model, failing to address real-world problems effectively [16]. Group 4: Data Closed Loop Practices - The company has developed a more aggressive approach to data closed loops, treating data as a product and metrics as primary citizens [24]. - The overall strategy involves quantifying real-world pain points and using triggers to convert these into actionable data [25]. Group 5: Trigger Mechanism - The trigger mechanism is designed to be lightweight and high-recall, ensuring that significant events are captured without overwhelming the system [32]. - Once a trigger is activated, it generates a micro log that is uploaded for further analysis, leading to more detailed data collection if necessary [35]. Group 6: Unified Trigger Framework - A unified trigger framework using Python allows for consistent implementation across vehicle data mining, cloud data analysis, and simulation validation [50]. - This framework enables non-technical team members to participate in writing rules, thus democratizing the process of data analysis [54]. Group 7: Distinction Between World Labels and Algorithm Labels - The company maintains two types of labels: world-level labels that describe objective physical conditions and model-level labels that depend on algorithm performance [61]. - This distinction is crucial for effective data analysis and problem-solving in the autonomous driving context [61]. Group 8: Use of Generative and Simulation Data - Generative data is primarily used to address long-tail scenarios that are difficult to encounter in real life, but real data remains essential for evaluation and validation [67]. - The company emphasizes the importance of filtering data through structured labels before applying vector retrieval methods to ensure efficiency and accuracy [64].
向量检索爆雷!傅聪联合浙大发布IceBerg Benchmark:HNSW并非最优,评估体系存在严重偏差
量子位· 2025-12-25 11:51
Core Insights - The integration of multimodal data into RAG and agent frameworks is a hot topic in the LLM application field, with vector retrieval being the most natural recall method for multimodal data [1] - There is a misconception that vector retrieval methods have been standardized, particularly the use of HNSW, which does not perform well in many downstream tasks [1] - A new benchmark called IceBerg has been introduced to evaluate vector retrieval algorithms based on downstream semantic tasks rather than traditional metrics like Recall-QPS, challenging past industry perceptions [1] Group 1: Misconceptions in Vector Retrieval - Many believe that vector retrieval methods are standardized, leading to a reliance on HNSW without considering its performance in real-world tasks [1] - The evaluation systems used in the past only scratch the surface of the complexities involved in vector retrieval [1] - A significant disparity exists between the perceived effectiveness of vector retrieval methods and their actual performance in downstream tasks [7] Group 2: Case Studies and Findings - In a large-scale facial verification dataset (Glink360K), the accuracy of facial recognition reached saturation before achieving a Recall of 99%, indicating a disconnect between distance metrics and actual task performance [5] - NSG, a state-of-the-art vector retrieval algorithm, shows absolute advantages in distance metric recall but underperforms in downstream semantic tasks compared to RaBitQ [5] - Different metric spaces can lead to vastly different outcomes in downstream tasks, highlighting the importance of metric selection in vector retrieval [6] Group 3: Information Loss and Model Limitations - An information loss funnel model is proposed to illustrate how information is lost at each stage of the embedding process, leading to discrepancies in expected outcomes [7] - The capacity of representation models directly affects the quality of embeddings, with generalization errors and learning objectives impacting performance [10][11] - Many models do not prioritize learning a good metric space, which can lead to significant information loss during the embedding process [13] Group 4: Metric and Algorithm Selection - The choice of metric (Euclidean vs. inner product) can have a substantial impact on results, especially when using generative representation models [15] - Different vector retrieval methods, categorized into space partitioning and graph-based indexing, perform differently based on data distribution [17] - The IceBerg benchmark reveals a reshuffling of vector retrieval algorithm rankings, demonstrating that HNSW is not always the top performer in downstream tasks [18] Group 5: Automation and Future Directions - IceBerg provides an automated algorithm selection tool that helps users choose the right method without extensive background knowledge [21] - Statistical indicators can reveal the affinity of embeddings to metrics and algorithms, facilitating automated decision-making [23] - The research team calls for future vector retrieval studies to focus on task-metric compatibility and the development of unified vector retrieval algorithms [25]
微软研究院路保同:用向量检索重塑模型注意力——Attention
3 6 Ke· 2025-11-17 08:02
Core Insights - The article discusses the limitations of long-context reasoning in large language models (LLMs) due to the quadratic complexity of self-attention and the significant memory requirements for key-value (KV) caching [1][5] - It introduces a new mechanism called Retrieval Attention, which accelerates long-context LLM inference through a dynamic sparse attention approach that does not require retraining [1][8] Group 1: Retrieval Attention Mechanism - Retrieval Attention posits that each query only needs to interact with a small subset of keys, making most attention redundant [3][7] - The approach involves offloading most KV vectors from the GPU to the CPU, using approximate nearest neighbor (ANN) search to identify the most relevant keys for each query [3][7] - This mechanism allows for significant reductions in memory usage, with an 8B model requiring only about 1/10 of the original memory for KV caching while maintaining accuracy [22] Group 2: Performance Metrics - Empirical tests on an RTX 4090 (24GB) show that the 8B model can stably generate with 128K context at approximately 0.188 seconds per token, achieving nearly the same precision as full attention [5][6] - The subsequent work, RetroInfer, demonstrated a 4.5 times increase in decoding throughput on A100 GPUs compared to full attention and a 10.5 times increase in throughput for 1M token contexts compared to other sparse attention systems [5][22] Group 3: System Architecture - The architecture of Retrieval Attention features a dual-path attention mechanism where the GPU retains a small amount of "predictable" local KV cache, while the CPU dynamically retrieves a large-scale KV store [7][8] - This design leads to a reduction in both memory usage and inference latency, allowing for efficient long-context reasoning without retraining the model [8][22] Group 4: Theoretical and Practical Contributions - The work presents a new theoretical perspective by framing the attention mechanism as a retrieval system, allowing for more precise identification of important contextual information [23][25] - It also emphasizes system-level optimizations, transforming traditional linear caching into a dynamic allocation structure that enhances efficiency in large-scale inference scenarios [23][25] Group 5: Future Directions - Future research may focus on establishing a more rigorous theoretical framework for the error bounds of Retrieval Attention and exploring the integration of dynamic learning mechanisms with system-level optimizations [26][30] - The long-term implications of this research could lead to models with true long-term memory capabilities, enabling them to maintain semantic consistency over extensive contexts [30][31]
什么是倒排索引(Inverted Index)?
Sou Hu Cai Jing· 2025-09-04 04:14
Core Insights - Inverted index is a data structure that maps each term to a list of documents containing that term, facilitating quick document retrieval based on keywords [1][3] - The construction of inverted indexes involves three main steps: text preprocessing, dictionary generation, and the creation of inverted record tables [1] - Inverted index technology is widely used in various data processing fields, demonstrating significant practical value, especially in search engines, log analysis systems, and recommendation systems [3] Industry Applications - Elasticsearch and similar systems utilize inverted indexes for millisecond-level text retrieval responses in full-text search engines [3] - Log analysis systems leverage inverted indexes to quickly locate specific error messages or user behavior patterns [3] - The combination of inverted indexes and vector retrieval technology is advancing Retrieval-Augmented Generation (RAG) technology, supporting both exact matching and semantic similarity searches [3] Company Developments - StarRocks, a next-generation real-time analytical database, showcases significant advantages in inverted index technology, supporting full-text search and efficient text data queries [5] - The enterprise version of StarRocks, known as Jingzhou Database, enhances inverted index performance with distributed construction capabilities, handling petabyte-scale indexing tasks [8] - Tencent has adopted StarRocks as the core technology platform for building a large-scale vector retrieval system, overcoming performance and scalability challenges of traditional retrieval solutions [8] Performance Improvements - The solution based on StarRocks has achieved over 80% reduction in query response time compared to traditional methods while supporting larger data processing needs [8] - The optimized inverted index structure and query algorithms in Tencent's system enable complex multidimensional query conditions while maintaining millisecond-level response times [8]
只改2行代码,RAG效率暴涨30%!多种任务适用,可扩展至百亿级数据规模应用
量子位· 2025-06-20 10:31
Core Viewpoint - The article discusses a new open-source method called PSP (Proximity graph with Spherical Pathway) developed by a team from Zhejiang University, which significantly improves the efficiency of RAG vector retrieval by 30% with just two lines of code. This method is applicable to various tasks such as text-to-text, image-to-image, text-to-image, and recommendation system recall, and is scalable for large-scale applications involving billions of data points [1]. Group 1: Vector Retrieval Methodology - Traditional vector retrieval methods are primarily based on Euclidean distance, focusing on "who is closest," while AI often requires comparison of "semantic relevance," which is better represented by maximum inner product [2]. - Previous inner product retrieval methods failed to satisfy the mathematical triangle inequality, leading to inefficiencies [3]. - The PSP method allows for minor modifications to existing graph structures to find optimal solutions for maximum inner product retrieval [4]. Group 2: Technical Innovations - PSP incorporates an early stopping strategy to determine when to end the search, thus conserving computational resources and speeding up the search process [5]. - The combination of vector models and vector databases is crucial for maximizing the potential of this technology, with the choice of "metric space" being a key factor [6]. - Many existing graph-based vector retrieval algorithms, such as HNSW and NSG, are designed for Euclidean space, which can lead to "metric mismatch" issues in scenarios better suited for maximum inner product retrieval [7]. Group 3: Algorithmic Insights - The research identifies two paradigms in maximum inner product retrieval: converting maximum inner product to minimum Euclidean distance, which often results in information loss, and directly searching in inner product space, which lacks effective pruning methods [8]. - The challenge in direct inner product space retrieval lies in its failure to meet the criteria of a strict "metric space," particularly the absence of the triangle inequality [9]. - The PSP team demonstrated that a greedy algorithm can find the global optimal maximum inner product solution on a graph index designed for Euclidean distance [10]. Group 4: Practical Applications and Performance - The PSP method modifies the candidate point queue settings and distance metrics to optimize search behavior and avoid redundant calculations [13]. - The search behavior for maximum inner product differs significantly from that in Euclidean space, often requiring a search pattern that expands from the inside out [16]. - The team conducted extensive tests on eight large-scale, high-dimensional datasets, demonstrating that PSP outperforms existing state-of-the-art methods in terms of stability and efficiency [21][23]. Group 5: Scalability and Generalization - The datasets used for testing included various modalities such as text-to-text, image-to-image, and recommendation system recall, showcasing the strong generalization capabilities of PSP [25]. - PSP exhibits excellent scalability, with time complexity showing logarithmic growth rates, making it suitable for efficient retrieval in datasets containing billions to hundreds of billions of points [26].