Workflow
无停滞检索
icon
Search documents
搜索Agent最新高效推理框架:吞吐量翻3倍、延迟降至1/5,还不牺牲答案质量丨南开& UIUC研究
量子位· 2025-05-29 01:08
Core Insights - The article discusses the efficiency challenges faced by AI-driven search agents, particularly those powered by large language models (LLMs), and introduces a new framework called SearchAgent-X that significantly enhances performance [1][3][32]. Efficiency Bottlenecks - The research identifies two main efficiency bottlenecks in search agents: retrieval accuracy and retrieval latency [4][8]. - Retrieval accuracy is not a straightforward relationship; both low and high precision can negatively impact efficiency. Low precision leads to increased rounds of retrieval, while high precision consumes excessive computational resources [5][6][7]. - Search agents benefit from high recall rate approximate searches, which support reasoning without incurring unnecessary costs [7]. Latency Issues - Search agents are highly sensitive to retrieval latency, where even minor increases can lead to significant end-to-end delays, sometimes up to 83 times [11]. - Improper scheduling and retrieval stalls are identified as primary causes of latency, with data showing that up to 55.9% of tokens may be unnecessarily recomputed due to scheduling issues [13]. SearchAgent-X Framework - SearchAgent-X employs two main acceleration mechanisms: priority-aware scheduling and non-stall retrieval [14][16]. - Priority-aware scheduling dynamically prioritizes concurrent requests to minimize unnecessary waiting and redundant computations [17][18]. - Non-stall retrieval allows for flexible, non-blocking searches, enabling early termination of retrieval when results are deemed sufficient [19][20][22]. Performance Improvements - In practical tests, SearchAgent-X demonstrated a throughput increase of 1.3 to 3.4 times and reduced average latency to 20% to 60% of baseline systems [27]. - The framework maintained generation quality comparable to baseline systems, with slight improvements in accuracy observed in some datasets due to the nature of approximate retrieval [28][29]. Technical Contributions - Each optimization component contributes significantly to overall performance, with priority scheduling reducing end-to-end latency by 35.55% and improving cache hit rates [30]. - Non-stall retrieval further enhances cache hit rates and reduces latency, emphasizing the importance of minimizing waiting times in complex AI systems [31]. Future Outlook - The article concludes that future AI systems will require more frequent interactions with external tools and knowledge bases, highlighting the need to address existing efficiency bottlenecks [32][33]. - It emphasizes the importance of balancing the performance of individual tools within the overall workflow of AI agents to avoid compounding delays and inefficiencies [34].