Workflow
Avi Chawla
icon
Search documents
X @Avi Chawla
Avi Chawla· 2025-12-10 19:56
Performance Improvement - The challenge is to accelerate the token generation speed of a GPT model from 100 tokens in 42 seconds, aiming for a 5x improvement [1] Interview Scenario - The scenario involves an AI Engineer interview at OpenAI, highlighting the importance of understanding optimization techniques beyond simply allocating more GPUs [1]
X @Avi Chawla
Avi Chawla· 2025-12-10 12:17
Model Performance - The model currently generates 100 tokens in 42 seconds [1] - The goal is to achieve a 5x speed improvement in token generation [1] Optimization Strategies - Simply allocating more GPUs is an insufficient solution for optimizing model speed [1]
X @Avi Chawla
Avi Chawla· 2025-12-10 06:42
Key Concepts - KV caching accelerates inference by pre-computing the prompt's KV cache before token generation [1] - This pre-computation explains the longer time-to-first-token (TTFT) observed in models like ChatGPT [1] Performance Bottleneck - Time-to-first-token (TTFT) is a significant performance metric in inference [1] - Improving TTFT is an area for further research and development [1]
X @Avi Chawla
Avi Chawla· 2025-12-10 06:42
This is called KV caching!To reiterate, instead of redundantly computing KV vectors of all context tokens, cache them.To generate a token:- Generate QKV vector for the token generated one step before.- Get all other KV vectors from cache.- Compute attention.Check this👇 https://t.co/TvwvdoXJ6m ...
X @Avi Chawla
Avi Chawla· 2025-12-10 06:41
Performance Bottleneck - The initial response suggests a simple solution of allocating more GPUs, but it misses deeper optimization opportunities [1] - The model generates 100 tokens in 42 seconds, implying a need for significant speed improvement [1] Missed Optimization Opportunities - The response lacks exploration of algorithmic optimizations or model architecture improvements [1] - The response doesn't consider potential software or hardware bottlenecks beyond GPU allocation [1]
X @Avi Chawla
Avi Chawla· 2025-12-09 19:31
RT Avi Chawla (@_avichawla)AWS did it again!They have introduced a novel way for developers to build Agents.Today, when you build an Agent, you start with a simple goal, then end up juggling prompts, routing logic, error handling, tool orchestration, and fallback flows.One unexpected user input and the whole thing collapses.Strands Agents framework by AWS approaches Agent building differently.It takes a model-driven approach that lets the LLM decide how to plan, choose tools, and adapt to edge cases on its ...
X @Avi Chawla
Avi Chawla· 2025-12-09 13:00
If you found it insightful, reshare it with your network.Find me → @_avichawlaEvery day, I share tutorials and insights on DS, ML, LLMs, and RAGs. https://t.co/np057bqlC3Avi Chawla (@_avichawla):AWS did it again!They have introduced a novel way for developers to build Agents.Today, when you build an Agent, you start with a simple goal, then end up juggling prompts, routing logic, error handling, tool orchestration, and fallback flows.One unexpected user input and https://t.co/KPS3aKAer9 ...
X @Avi Chawla
Avi Chawla· 2025-12-09 06:31
GitHub repo: https://t.co/twd4zJVi0U ...
X @Avi Chawla
Avi Chawla· 2025-12-09 06:31
AWS did it again!They have introduced a novel way for developers to build Agents.Today, when you build an Agent, you start with a simple goal, then end up juggling prompts, routing logic, error handling, tool orchestration, and fallback flows.One unexpected user input and the whole thing collapses.Strands Agents framework by AWS approaches Agent building differently.It takes a model-driven approach that lets the LLM decide how to plan, choose tools, and adapt to edge cases on its own.You provide the capabil ...
X @Avi Chawla
Avi Chawla· 2025-12-08 19:06
Educational Resources - Stanford's CS336 provides a video guide to Karpathy's nanochat, covering essential topics for Frontier AI Labs preparation [1] Key AI Concepts - The curriculum includes Tokenization, Resource Accounting, Pretraining, Finetuning (SFT/RLHF), Key Architectures, GPUs, Kernels, Tritons, Parallelism, Scaling Laws, Inference, Evaluation, and Alignment [1]