Grace Blackwell, explained: 128GB of unified memory

The NVIDIA GB10 Grace Blackwell superchip gets introduced with a headline figure: roughly a petaflop of AI performance. It's a real number and it's impressive. But if you only look at the FLOPs, you'll miss the part of the design that decides what you can actually load onto the machine. For large language model inference, the gating resource is almost never raw compute — it's memory. And the GB10's memory architecture is the genuinely unusual thing about it.

What a Grace Blackwell superchip actually is

The GB10 is not a graphics card. It's a superchip: two distinct processors fused onto a single package and engineered to behave as one system. On one side sits the Grace CPU — a 20-core Arm processor built for general-purpose work, orchestration, and feeding data. On the other sits the Blackwell GPU, NVIDIA's current-generation accelerator with 5th-generation Tensor Cores, where the matrix math of inference happens.

Pairing a strong CPU directly with a GPU isn't new on its own. What makes Grace Blackwell different is how the two halves are joined, and what they share once they're joined.

NVLink-C2C and one coherent memory space

The two dies are connected by NVLink-C2C, a chip-to-chip interconnect that is coherent. Coherence is the key word. It means the Grace CPU and the Blackwell GPU address the same 128GB of unified LPDDR5X memory, with the hardware keeping both processors' views of that memory consistent. There is one pool of memory, and both processors see it as one pool.

To appreciate why that matters, look at how a conventional system works. A discrete GPU has its own dedicated VRAM, separate from the host's system RAM, with a PCIe bus in between. Any data the GPU needs has to be copied from host memory across PCIe into device memory, and results copied back. For LLM inference this is a constant tax: model weights, prompt tokens, and intermediate state all shuttle back and forth, and PCIe bandwidth becomes a wall you keep hitting.

What disappears when memory is unified

On the GB10, that host-to-device copy step largely goes away. The CPU can prepare data and the GPU can consume it from the same address, without a round trip across PCIe. Three practical consequences follow:

First, you stop paying the copy tax — less wasted bandwidth and lower latency on the data movement that used to bottleneck a pipeline. Second, the KV cache — the per-token attention state that grows with context length — has a large, shared pool to live in, so long contexts don't immediately blow your memory budget. Third, large models can simply stay resident: the weights load once into unified memory and remain there, available to both processors.

Unified vs. shared vs. discrete memory

Discrete: the GPU has its own separate VRAM; everything must be copied to it over PCIe. Shared: the GPU borrows a slice of ordinary system RAM, which is cheap but slow and not built for GPU bandwidth. Unified (the GB10): one fast memory pool that both the CPU and GPU address coherently as first-class citizens — no copying, no second-class slice. It's the difference between two people emailing files back and forth and two people editing the same document.

Why 128GB unified beats a 24GB discrete card

Here's where the architecture stops being abstract. A typical high-end consumer GPU ships with 24GB to 32GB of VRAM. That's plenty for a small model, but it puts a hard ceiling on size: the model's weights plus its KV cache have to fit inside that VRAM, full stop. A 70B-parameter model, even quantized to 4-bit, needs roughly 35GB just for weights — before any context. On a 24GB card it doesn't fit at all, and you're forced to either run a smaller model or offload layers to system RAM and accept a steep speed penalty.

With 128GB of unified memory, a 70B model in 4-bit fits comfortably, with tens of gigabytes of headroom left over for a generous KV cache and long context windows. The constraint isn't "will it fit" — it's "how much context do you want." That's a categorically different class of machine for local inference, and it's why the GB10 can run 70B-plus models locally at all. When even one machine isn't enough, two GB10 units can be linked over ConnectX to pool capacity for still-larger models.

128 GBUnified memory

NVLink-C2CCoherent CPU–GPU link

~1 PFLOPFP4 AI performance

FP4 and what low precision buys you

The other half of the Blackwell story is precision. The GB10's 5th-generation Tensor Cores support FP4 — a 4-bit floating-point format — and that's where the headline ~1 PFLOP figure comes from. Lower precision is a deliberate trade: you spend a little numerical accuracy to gain a lot of throughput. Each value takes fewer bits, so more of them move per cycle and more fit in memory at once.

For inference, that translates into two things people actually feel. Throughput goes up — more tokens generated per second, because the hardware is built to chew through these low-precision operations. And energy per token goes down — fewer bits moved and multiplied means less work and less power for the same output. Combined with unified memory, FP4 is what lets a single desktop-class machine serve a 70B model at a usable speed without a rack and a cooling bill behind it.

So when you see "a petaflop of AI performance," read it correctly: it's the FP4 ceiling, and it only matters because the memory architecture lets you keep a serious model and its context resident long enough to put that compute to work. The FLOPs are the engine. The 128GB of coherent, unified memory is the road that makes the engine worth having.

Grace Blackwell, explained: 128GB of unified memory and what it unlocks

What a Grace Blackwell superchip actually is

NVLink-C2C and one coherent memory space

What disappears when memory is unified

Why 128GB unified beats a 24GB discrete card

FP4 and what low precision buys you

Want to feel the difference yourself?