KV caching

Search documents
X @Avi Chawla
Avi Chawla· 2025-08-06 06:31
1️⃣2️⃣ KV cachingKV caching is a technique used to speed up LLM inference.I have linked my detailed thread below👇 https://t.co/Dt1uH4iniqAvi Chawla (@_avichawla):KV caching in LLMs, clearly explained (with visuals): ...
X @Avi Chawla
Avi Chawla· 2025-07-27 19:23
LLM技术解析 - KV caching in LLMs:LLM 中的 KV 缓存机制被清晰地解释,并附有可视化图表 [1]
X @Avi Chawla
Avi Chawla· 2025-07-27 06:31
Key Takeaways - The author encourages readers to reshare the content if they found it insightful [1] - The author shares tutorials and insights on DS (Data Science), ML (Machine Learning), LLMs (Large Language Models), and RAGs (Retrieval-Augmented Generation) daily [1] Focus Area - The content clearly explains KV caching in LLMs with visuals [1] Author Information - Avi Chawla's Twitter handle is @_avichawla [1]
X @Avi Chawla
Avi Chawla· 2025-07-27 06:30
Technology Overview - KV caching is utilized in Large Language Models (LLMs) to enhance performance [1] - The document provides a clear explanation of KV caching in LLMs with visuals [1]
How fast are LLM inference engines anyway? — Charles Frye, Modal
AI Engineer· 2025-06-27 10:01
Open Model Landscape & Benchmarking - Open-weight models are catching up to Frontier Labs in capabilities, making many AI Engineer applications possible that weren't before [1] - Open-source engines like VLM, SGLang, and Tensor TLM are readily available, reducing the need for custom model implementations [1] - Modal has created a public benchmark (modal.com/llmalmanac) for comparing the performance of different models and engines across various context lengths [2][3] Performance Analysis - Throughput is significantly higher when processing longer input contexts (prefill) compared to generating longer output sequences (decode), with up to a 4x improvement observed [15][16] - The time to first token (latency) remains nearly constant even with a 10x increase in input tokens, suggesting a "free lunch" by prioritizing context over reasoning [19] - Gemma 7B models show roughly the same throughput as Qwen 3 models, despite being 10x smaller in model weights, indicating optimization differences [12] Optimization & Infrastructure - Scaling out (adding more GPUs) is the primary method for increasing total throughput, rather than scaling up (optimizing a single GPU) [23] - Benchmarking methodology involves sending a thousand requests to determine maximum throughput and sending single requests to determine fastest possible server run time [24][25] - BF16 precision offers slower tensor core support compared to FP8 or FP4, suggesting potential for even greater performance gains with lower precision formats on newer hardware like Blackwell [16][17]