← The Log
Benchmarks·Apr 23, 2026·7 min read

Running 70B models locally: real-world throughput on the GB10

What it's actually like to serve Llama 3.3 70B from a single Grace Blackwell — the context room, the tokens per second, and where a desktop-class GB10 quietly shines against a rack of data-center GPUs.

The first question people ask about the GB10 is some version of "can it even run a 70B model?" The short answer is yes, comfortably, with room to spare. The longer and more honest answer — the one this post is about — is what running one feels like once it's loaded: how fast tokens come back, how much context you can keep in play, and which workloads make a single Grace Blackwell the right tool versus the wrong one.

Why a 70B model fits comfortably

The GB10 ships with 128 GB of unified LPDDR5X memory shared between the CPU and GPU. That unified pool is the headline feature for local inference, because it changes the math on what fits.

A 70-billion-parameter model in full 16-bit precision wants roughly 140 GB just for weights — too much. But almost nobody serves 70B models at FP16 for interactive use anymore. In 4-bit quantization, those same weights compress to around 40 GB. Load that into a 128 GB pool and you've used less than a third of your memory on the model itself.

The rest is where the good news compounds. The leftover ~80 GB is free for the KV cache — the per-token attention state that grows with your context window. That's what lets you push toward very long contexts without the model falling over. On a 24 GB consumer card, a 4-bit 70B model barely fits with a tiny context; on the GB10 the same model leaves you with headroom measured in tens of gigabytes for cache and long documents.

~40 GB70B in 4-bit weights
128 GBUnified memory
128KContext room to work with

What throughput actually feels like

Here's where we stay honest. The GB10 is a desktop-class part built around high-capacity LPDDR5X, not the high-bandwidth HBM you find in data-center accelerators. For single-stream LLM decoding — one user, generating one token at a time — performance is bound almost entirely by how fast you can stream the model's weights through memory on every token. That makes the GB10 a memory-bandwidth-bound machine for this workload, and it sets realistic expectations.

For a 4-bit 70B model at batch=1 interactive use, you should think in the rough neighborhood of a handful to low tens of tokens per second — comfortably faster than you read, fast enough that a chat or coding session feels responsive, but not the triple-digit throughput a fleet of HBM GPUs delivers under heavy batching. Drop down to a smaller model — an 8B or a 14B — and the numbers climb substantially, because there's far less weight to move per token.

We're deliberately not quoting a single precise benchmark figure here, because the honest number depends on your exact quantization, context length, and sampler settings. The directional truth is what matters: smaller models and short prompts feel snappy, the 70B feels steady and usable, and you are paying for capacity and privacy, not for raw peak throughput.

Why inference is memory-bandwidth-bound

To generate each new token, the model has to read all of its active weights from memory at least once. With a 40 GB model, that's 40 GB of reads per token — so your ceiling is set by memory bandwidth, not raw compute. HBM trades capacity for enormous bandwidth; LPDDR5X trades bandwidth for capacity. The GB10 picks capacity, which is exactly why it can hold a 70B model with long context in the first place.

Where the GB10 shines — and where it doesn't

Once you frame it as a capacity-and-privacy machine rather than a throughput monster, the right use cases get obvious.

The sweet spot

Single-user, low-concurrency, private inference is the GB10's home turf. Concretely, that means:

Where a data-center GPU still wins is the mirror image: massive concurrent batch serving. If you're fanning out thousands of simultaneous requests, HBM bandwidth and multi-GPU scale-out will leave a single GB10 far behind. That's a different business — a public API serving the world — and it's not what one desktop is for.

Practical notes before you start

A few things worth knowing when you actually load a model:

The cleanest way to find out whether a 70B model at this throughput fits your workflow is to try it on real hardware before you commit to buying one. You can rent a full GB10 on GB10 Studio for about $1 an hour, load Llama 3.3 70B or anything else, and run your own prompts against your own context — no shared cluster, no rate limiter, just one Grace Blackwell that's yours for the session.

See it for yourself.

Spin up a private Grace Blackwell, load a 70B model, and run your own workload by the minute.

Spin up a 70B session