Workflow
语料与模型规模权衡
icon
Search documents
检索做大,生成做轻:CMU团队系统评测RAG的语料与模型权衡
机器之心· 2026-01-06 00:31
Core Insights - The core argument of the research is that expanding the retrieval corpus can significantly enhance Retrieval-Augmented Generation (RAG) performance, often providing benefits that can partially substitute for increasing model parameters, although diminishing returns occur at larger corpus sizes [4][22]. Group 1: Research Findings - The study reveals that the performance of RAG is determined by both the retrieval module, which provides evidence, and the generation model, which interprets the question and integrates evidence to form an answer [7]. - The research indicates that smaller models can achieve performance levels comparable to larger models by increasing the retrieval corpus size, with a consistent pattern observed across multiple datasets [11][12]. - The findings show that the most significant performance gains occur when moving from no retrieval to having retrieval, with diminishing returns as the corpus size increases [13]. Group 2: Experimental Design - The research employed a full factorial design, varying only the corpus size and model size while keeping other variables constant, using a large dataset of approximately 264 million real web documents [9]. - The evaluation covered three open-domain question-answering benchmarks: Natural Questions, TriviaQA, and Web Questions, using common metrics such as F1 and ExactMatch [9]. Group 3: Mechanisms of Improvement - The increase in corpus size enhances the probability of retrieving answer-containing segments, leading to more reliable evidence for the generation model [16]. - The study defines the Gold Answer Coverage Rate, which measures the probability that at least one of the top chunks provided to the generation model contains the correct answer string, showing a monotonic increase with corpus size [16]. Group 4: Practical Implications - The research suggests that when resources are constrained, prioritizing the expansion of the retrieval corpus and improving coverage can allow medium-sized generation models to perform close to larger models [20]. - The study emphasizes the importance of tracking answer coverage and utilization rates as diagnostic metrics to identify whether bottlenecks are in the retrieval or generation components [20].