Hacking the Inference Pareto Frontier - Kyle Kranen, NVIDIA
Your model works! It aces the evals! It even passes the vibe check! All that’s required is inference, right? Oops, you’ve just stepped into a minefield: -Not low-latency enough? Choppy experience. Users churn from your app. -Not cheap enough? You’re losing money on every query. -Not high enough output quality? Your system can’t be used for that application. A model and the inference system around it form a “token factory” associated with a Pareto frontier— a curve representing the best possible trade-offs b ...