GB10 vs. the Cloud: The Real Economics of Owning Your Inference

Hosted inference APIs are one of the great deals in software. You pay a few fractions of a cent per thousand tokens, you never touch a GPU, and you scale from one request to a million without thinking about it. For most teams, most of the time, that's exactly right. This post is not an argument that the cloud is bad.

It's an argument that per-token pricing has a shape — and that once your workload changes shape, so does the cheaper option. The question isn't “cloud or hardware,” it's “which one matches how I actually use inference.”

How per-token pricing scales

The thing that makes hosted APIs feel cheap is also what makes them expensive at scale: you pay for every single token, forever. There's no ceiling. A penny per thousand tokens is invisible when you're processing a few thousand requests a day. It is a budget line item when you're processing a few million.

Consider a steady production workload — a document pipeline, an agent that runs all day, a RAG system serving internal users. Suppose it processes, say, 50 million tokens a day across prompts and completions. At a typical hosted price, that's a meaningful monthly bill, and it grows linearly with every new user and every longer context window. Your costs scale with your success, which is fine until the curve gets steep. Per-token pricing rewards you for being small and quiet, and penalizes you for being busy.

What a GB10 actually costs to run

A GB10 has a completely different cost structure, because the expensive part is fixed. The Grace Blackwell GB10 is a real desktop-class machine: 128GB of unified memory, roughly one petaFLOP of FP4 compute, enough to run 70B-parameter models comfortably. If you own one, your cost is the capital outlay amortized over the machine's useful life, plus electricity — and the electricity is genuinely low, because a GB10 runs off a standard wall outlet, not a datacenter power feed.

Amortize the hardware over a few years and add the power bill, and the marginal cost of a token approaches zero. You're not paying per token; you're paying for the box, and then you run it as hard as you like. The trade is that idle time is wasted money — a GB10 sitting dark overnight is still depreciating.

That's exactly the gap renting closes. On GB10 Studio you rent a private Grace Blackwell from about $1.00/hour, billed per minute, behind an OpenAI-compatible API. You get the flat, predictable cost structure of owning hardware without the capital outlay or the idle-time penalty — you pay for the hours you actually run, and nothing when you don't.

$1.00/hrTo rent a private GB10

$0Egress fees

100%Your data stays yours

The costs you stop paying

Token price isn't the whole bill. Hosted inference comes with a set of quieter costs that don't show up on the per-token line item but absolutely show up in practice.

Egress. Moving data in and out of a cloud has a price. Run a high-throughput pipeline and the bandwidth bill becomes its own line item. On a GB10 session, there are no egress fees — period.
Rate limits and queueing. Shared endpoints throttle you. When you need throughput, you're competing with every other tenant for the same capacity, and your latency becomes someone else's problem. A private GB10 has no queue and no neighbor.
Data leaving your environment. Every prompt you send to a hosted API is your data on someone else's hardware. For regulated, proprietary, or sensitive workloads, that's a compliance and risk cost even when it's not a dollar cost. On a private instance, the prompts never leave the machine you rented.

The honest caveat

None of this makes the cloud wrong. If your traffic is spiky, low-volume, or unpredictable, a per-token API is almost certainly cheaper and simpler — you pay nothing when you're idle and you never manage capacity. The break-even only tilts toward hardware when your usage is steady and substantial.

A simple break-even

Here's the framing that cuts through it. Per-token pricing is cheapest when your utilization is low, because you pay only for what you use. Flat hourly hardware is cheapest when your utilization is high, because the cost stops climbing. The crossover is roughly a question of how many hours a day you're actually running inference.

If you spin up inference for a few minutes here and there, on-demand rental or a hosted API wins — don't pay for idle hours. If you're running inference for more than a few hours a day, steadily, the flat-rate math takes over: renting a GB10 by the hour, or owning one outright if you can keep it busy, beats paying per token. And if your workload is also private or high-volume, the hidden costs push the line even further in hardware's favor.

That's the whole calculus. Cloud APIs are a brilliant default for bursty, casual, exploratory work. The moment inference becomes a steady, serious part of what you do, owning or renting real hardware stops being the expensive option and becomes the cheap one — and you get privacy and predictable bills as the bonus.

GB10 vs. the cloud: the real economics of owning your inference

How per-token pricing scales

What a GB10 actually costs to run

The costs you stop paying

A simple break-even

Run inference more than a few hours a day?