NVIDIA Rubin CPX Accelerates Inference for Million‑Token Context AI
NVIDIA·2025-09-09 15:19
AI has truly crossed the chasm from back office proof of concepts to full mainstream production. And the way that AI is getting deployed and used is through inference. So when we think about inference, it's actually two workloads.There's the context processing which we call prefill and then there's the generation of tokens which we call decode. In order to serve those two disparate processes more efficiently, the industry has developed disagregated serving. It's a method by which you can basically separate ...