Running 70B models locally: real-world throughput on the GB10
What it's actually like to serve Llama 3.3 70B from a single Grace Blackwell — the context room, the tokens per second, and where a desktop-class GB10 quietly shines against a rack of data-center GPUs.
The first question people ask about the GB10 is some version of "can it even run a 70B model?" The short answer is yes, comfortably, with room to spare. The longer and more honest answer — the one this post is about — is what running one feels like once it's loaded: how fast tokens come back, how much context you can keep in play, and which workloads make a single Grace Blackwell the right tool versus the wrong one.
Why a 70B model fits comfortably
The GB10 ships with 128 GB of unified LPDDR5X memory shared between the CPU and GPU. That unified pool is the headline feature for local inference, because it changes the math on what fits.
A 70-billion-parameter model in full 16-bit precision wants roughly 140 GB just for weights — too much. But almost nobody serves 70B models at FP16 for interactive use anymore. In 4-bit quantization, those same weights compress to around 40 GB. Load that into a 128 GB pool and you've used less than a third of your memory on the model itself.
The rest is where the good news compounds. The leftover ~80 GB is free for the KV cache — the per-token attention state that grows with your context window. That's what lets you push toward very long contexts without the model falling over. On a 24 GB consumer card, a 4-bit 70B model barely fits with a tiny context; on the GB10 the same model leaves you with headroom measured in tens of gigabytes for cache and long documents.
What throughput actually feels like
Here's where we stay honest. The GB10 is a desktop-class part built around high-capacity LPDDR5X, not the high-bandwidth HBM you find in data-center accelerators. For single-stream LLM decoding — one user, generating one token at a time — performance is bound almost entirely by how fast you can stream the model's weights through memory on every token. That makes the GB10 a memory-bandwidth-bound machine for this workload, and it sets realistic expectations.
For a 4-bit 70B model at batch=1 interactive use, you should think in the rough neighborhood of a handful to low tens of tokens per second — comfortably faster than you read, fast enough that a chat or coding session feels responsive, but not the triple-digit throughput a fleet of HBM GPUs delivers under heavy batching. Drop down to a smaller model — an 8B or a 14B — and the numbers climb substantially, because there's far less weight to move per token.
We're deliberately not quoting a single precise benchmark figure here, because the honest number depends on your exact quantization, context length, and sampler settings. The directional truth is what matters: smaller models and short prompts feel snappy, the 70B feels steady and usable, and you are paying for capacity and privacy, not for raw peak throughput.
To generate each new token, the model has to read all of its active weights from memory at least once. With a 40 GB model, that's 40 GB of reads per token — so your ceiling is set by memory bandwidth, not raw compute. HBM trades capacity for enormous bandwidth; LPDDR5X trades bandwidth for capacity. The GB10 picks capacity, which is exactly why it can hold a 70B model with long context in the first place.
Where the GB10 shines — and where it doesn't
Once you frame it as a capacity-and-privacy machine rather than a throughput monster, the right use cases get obvious.
The sweet spot
Single-user, low-concurrency, private inference is the GB10's home turf. Concretely, that means:
- Coding assistants and agentic loops. Long, iterative back-and-forth where one developer drives the model hard for hours.
batch=1is the normal case, not the degenerate one. - Long-context work. Feeding whole codebases, contracts, or research papers into a 70B model and keeping that context resident, thanks to all that free KV-cache headroom.
- Private and air-gapped inference. Prompts that legally or competitively cannot leave your control. The model runs on one machine you can point to.
Where a data-center GPU still wins is the mirror image: massive concurrent batch serving. If you're fanning out thousands of simultaneous requests, HBM bandwidth and multi-GPU scale-out will leave a single GB10 far behind. That's a different business — a public API serving the world — and it's not what one desktop is for.
Practical notes before you start
A few things worth knowing when you actually load a model:
- Quantization is your main dial. 4-bit (Q4) is the pragmatic default for 70B — it roughly halves memory versus 8-bit with minimal quality loss for most tasks. Go to Q5 or Q6 if you have the headroom and want a touch more fidelity; the GB10 gives you room to experiment.
- Context length is a tradeoff, not a free lunch. Every token of context you keep resident consumes KV cache and adds latency to the first token. The GB10 lets you push context far — just don't reserve 128K tokens of room you'll never use.
- It speaks OpenAI. On GB10 Studio every slot exposes an OpenAI-compatible API, so the model you load is a drop-in for any tooling that already talks to
/v1/chat/completions.
The cleanest way to find out whether a 70B model at this throughput fits your workflow is to try it on real hardware before you commit to buying one. You can rent a full GB10 on GB10 Studio for about $1 an hour, load Llama 3.3 70B or anything else, and run your own prompts against your own context — no shared cluster, no rate limiter, just one Grace Blackwell that's yours for the session.
See it for yourself.
Spin up a private Grace Blackwell, load a 70B model, and run your own workload by the minute.
Spin up a 70B session