AI Inference Landscape - AI has transitioned to mainstream production, primarily through inference [1] - Inference involves two workloads: context processing (prefill) and token generation (decode) [2] - Disaggregated serving optimizes efficiency by separating prefill and decode processes [2] Emerging Advanced Use Cases - Advanced use cases require millions of tokens for input sequence lengths, necessitating specialized infrastructure [3] Reuben CPX Processor - The Reuben CPX processor delivers 30 pedaflops of AI performance [4] - It features 128 GB of cost-effective DDR7 memory and triples the attention for context processing [4] - The processor, combined with Dynamo software, MVFP4, and Reuben MVL 144 architecture, offers unprecedented performance and cost efficiency [4]
NVIDIA Rubin CPX Accelerates Inference for Million‑Token Context AI
NVIDIA·2025-09-09 15:19