Core Insights - NVIDIA has launched the Rubin CPX, a dedicated GPU designed to enhance AI inference efficiency by splitting the computation process into context and generation stages, achieving up to 6.5 times the efficiency of current flagship systems [1][3] Group 1: Product Overview - The Rubin CPX is specifically designed for long context workloads, aimed at doubling the efficiency of current AI inference operations, particularly in applications requiring extensive context windows like programming and video generation [1] - The next-generation flagship AI server, NVIDIA Vera Rubin NVL144 CPX, will integrate 36 Vera CPUs, 144 Rubin GPUs, and 144 Rubin CPX GPUs [1] Group 2: Performance Metrics - The upcoming flagship rack will provide 8 exaFLOPs of NVFP4 computing power, which is 7.5 times higher than the GB300 NVL72, along with 100 TB of high-speed memory and 1.7 PB/s memory bandwidth [5] - Deploying the new chip valued at $100 million is projected to generate $5 billion in revenue for customers [5] Group 3: Technical Innovation - NVIDIA's approach of using two GPUs for AI inference is a pioneering move, separating the computation load into context and generation phases, which have fundamentally different infrastructure requirements [6] - The context phase is compute-bound, requiring high throughput to analyze large input data, while the generation phase is memory bandwidth-bound, relying on high-speed memory transfer and bandwidth interconnects [8] Group 4: Target Applications - The Rubin CPX is optimized for long context performance, capable of handling "millions of tokens" with 30 petaFLOPs of NVFP4 computing power and 128 GB of GDDR7 memory [10] - Approximately 20% of AI applications may experience delays waiting for the first token, highlighting the need for improved processing capabilities in tasks like decoding extensive code or video frames [10]
英伟达新GPU突袭,性能拉爆当前旗舰