article

NVIDIA Unveils Next-Gen Rubin CPX GPU: Multi-Million Token Inference, A True Beast

4 min read

At Tuesday’s AI Infrastructure Summit, NVIDIA introduced the Rubin CPX (Rubin Context GPUs), a new GPU purpose-built for long-context inference workloads that exceed 1 million tokens.

For developers and creators, this breakthrough means more powerful performance across demanding tasks such as software development, long-form video generation, and research applications.

In software engineering, for example, AI systems need to reason across entire repositories—understanding project-level structures to provide meaningful assistance. Similarly, in long video generation or scientific research, models must sustain coherence and memory across millions of tokens. With the release of Rubin CPX, these bottlenecks are finally being addressed.

Image


A New Era of AI Infrastructure: Vera Rubin NVL144 CPX Platform

The Rubin CPX GPU will integrate seamlessly with NVIDIA Vera CPUs and Rubin GPUs, powering the new NVIDIA Vera Rubin NVL144 CPX platform. This MGX-based system delivers:

To support existing customers, NVIDIA will also offer Rubin CPX compute trays for upgrading current Vera Rubin NVL144 systems.

Image
Image
NVIDIA Vera Rubin NVL144 CPX rack and compute trays with Rubin CPX, Rubin GPUs, and Vera CPUs.


Jensen Huang: “A Leap Forward for AI Computing”

NVIDIA founder and CEO Jensen Huang described the launch as a defining moment:

“The Vera Rubin platform marks another leap forward at the frontier of AI computing. Not only does it introduce the next-generation Rubin GPUs, but also a new class of processors: CPX. Just as RTX transformed graphics and physical AI, Rubin CPX is the first CUDA GPU designed for large-scale context, enabling inference across millions of tokens in one pass.”

The industry response has been immediate: a game-changer for creators, developers, and researchers alike.

Image
Image


Rubin CPX: Breaking Through the Context Barrier

Large language models are evolving into intelligent agents capable of multi-step reasoning, persistent memory, and long-context comprehension. But these advances push infrastructure to its limits across compute, storage, and networking—demanding a fundamental rethinking of inference scaling.

Decoupled Inference Architecture

NVIDIA’s SMART framework provides the blueprint:

Inference workloads split into two distinct phases:

  1. Context processing — compute-intensive, requiring massive throughput to ingest and analyze data for the first token.
  2. Content generation — memory bandwidth-bound, dependent on NVLink interconnects for sustained per-token performance.

Image

By decoupling these phases, NVIDIA enables precise optimization of compute vs. memory resources—boosting throughput, cutting latency, and maximizing utilization.

The orchestration layer is NVIDIA Dynamo, a modular, open-source inference framework that already set new records in MLPerf benchmarks through decoupled inference on the GB200 NVL72.


Rubin CPX: Purpose-Built for Long-Context Inference

Against this backdrop, Rubin CPX GPU arrives as a specialized solution for long-context workloads:

These enhancements ensure faster, more stable performance when handling multi-million-token sequences.

Image


Industry Applause: From Code to Creativity


Availability

The NVIDIA Rubin CPX GPU is expected to launch in late 2026.

Until then, the industry eagerly awaits what is being hailed as the “beast” of context inference computing.