X @Avi Chawla

Core Innovation - REFRAG - Meta's REFRAG fundamentally rethinks retrieval in RAG setups by compressing and filtering context at a vector level [1] - REFRAG compresses each chunk into a single compressed embedding and uses a relevance policy trained via RL to select the most relevant chunks [1][2] - Only selected chunks are expanded back into full embeddings and passed to the LLM, processing only what matters [2] Technical Details - REFRAG encodes documents and stores them in a vector database [2] - It encodes the full user query, finds relevant chunks, and computes token-level embeddings for both [3] - A relevance policy, trained via RL, selects chunks to keep [3][5] - Token-level representations of the input query are concatenated with selected chunks and a compressed single-vector representation of rejected chunks before being sent to the LLM [3] Performance Metrics - REFRAG outperforms LLaMA on 16 RAG benchmarks [4][6] - It achieves 30.85x faster time-to-first-token, which is 3.75x better than previous state-of-the-art [4][6] - REFRAG handles 16x larger context windows [4][6] - It utilizes 2-4x fewer tokens [4][6] - REFRAG leads to no accuracy loss across RAG, summarization, and multi-turn conversation tasks [6]