KV cache

Search documents
X @Polyhedra
Polyhedra· 2025-09-25 12:00
6/Currently working on Gemma3 quantization, focusing on:- Learning the new model architecture- Adding KV cache support (which accelerates inference)- Implementing quantization support for some new operators-- Full operator support will require 1+ additional day, plus more time for accuracy testingStay tuned for more updates 🔥 ...
X @Avi Chawla
Avi Chawla· 2025-07-27 06:31
That said, KV cache also takes a lot of memory.Llama3-70B has:- total layers = 80- hidden size = 8k- max output size = 4kHere:- Every token takes up ~2.5 MB in KV cache.- 4k tokens will take up 10.5 GB.More users → more memory.I'll cover KV optimization soon. https://t.co/VjnyLa6aLa ...