Avi Chawla
Search documents
X @Avi Chawla
Avi Chawla· 2026-02-10 06:30
Learn how LLM inference actually works under the hood.vLLM has 100k+ lines of code. Mini-SGLang does the same core job in 5,000.It's a compact codebase that serves as both a capable inference engine and a transparent reference for researchers and devs. Something you can actually finish reading over a weekend.Here's what makes it special:↳ Clean, type-annotated code you can actually read↳ Radix cache to reuse KV cache across shared prefixes↳ Chunked prefill for long contexts without memory blowup↳ Tensor par ...
X @Avi Chawla
Avi Chawla· 2026-02-10 01:13
RT Avi Chawla (@_avichawla)Vector search is not always the answer.A 30-year-old algorithm with zero training, zero embeddings, and zero fine-tuning still powers Elasticsearch, OpenSearch, and most production search systems today.It's called BM25, and it's worth understanding why it refuses to die.Let's say you're searching for "transformer attention mechanism" in a library of ML papers.BM25 scores documents using three core ideas:1) Word rarity matters more than word frequencyEvery paper contains "the" and ...
X @Avi Chawla
Avi Chawla· 2026-02-09 06:30
This hybrid search stack I mentioned in the post is actually implemented in this open-source context retrieval layer for agents.GitHub repo: https://t.co/iU6P0KoaRf ...
X @Avi Chawla
Avi Chawla· 2026-02-09 06:30
Vector search is not always the answer.A 30-year-old algorithm with zero training, zero embeddings, and zero fine-tuning still powers Elasticsearch, OpenSearch, and most production search systems today.It's called BM25, and it's worth understanding why it refuses to die.Let's say you're searching for "transformer attention mechanism" in a library of ML papers.BM25 scores documents using three core ideas:1) Word rarity matters more than word frequencyEvery paper contains "the" and "is" so those words carry n ...
X @Avi Chawla
Avi Chawla· 2026-02-08 09:16
15-16) Set max_workers and pin_memory in DataLoader.PyTorch dataloader has two terrible default settings. Update them according to your config.Speedup is shown in the image below 👇 https://t.co/BBuk5RSqLS ...
X @Avi Chawla
Avi Chawla· 2026-02-08 09:15
14) Use momentumIn gradient descent, every parameter update solely depends on the current gradient. This leads to unwanted oscillations during optimization.Momentum reduces this by adding a weighted average of previous gradient updates to the update rule.Check this 👇 https://t.co/77X9rwRyOF ...
X @Avi Chawla
Avi Chawla· 2026-02-08 09:15
I have been training neural networks for 10 years now.Here are 16 ways I actively use to optimize model training:(detailed explanation ...🧵) https://t.co/5HyMgEOIks ...
X @Avi Chawla
Avi Chawla· 2026-02-07 06:32
Generative vs. discriminative models in ML:(a popular ML interview question) https://t.co/fVVwZBkCVR ...
X @Avi Chawla
Avi Chawla· 2026-02-02 19:15
RT Avi Chawla (@_avichawla)Your embedding stack forces a 100% re-index just to change models.And most teams treat that as unavoidable.Imagine you built a RAG pipeline with a large embedding model for high retrieval quality, and it ships to production.Six months later, your application traffic and your embedding model costs are soaring while your pipeline struggles to scale. You want to switch to a model that prioritizes cost and latency in order to meet this new demand.But your existing embeddings live in o ...
X @Avi Chawla
Avi Chawla· 2026-02-02 11:47
If you found it insightful, reshare it with your network.Find me → @_avichawlaEvery day, I share tutorials and insights on DS, ML, LLMs, and RAGs. https://t.co/lNSHKvmczqAvi Chawla (@_avichawla):Your embedding stack forces a 100% re-index just to change models.And most teams treat that as unavoidable.Imagine you built a RAG pipeline with a large embedding model for high retrieval quality, and it ships to production.Six months later, your application traffic and https://t.co/EtZ05xrK81 ...