Making Tokens, Pt. 4: A City's Worth of Power

Part four of the Making Tokens series. By now we have AI accelerator chips, finished and packaged (Part 3). The next step is putting tens of thousands of them in a single building and pointing them at a mathematical optimization for several months. The economics of doing this at the scale required for frontier AI are not subtle.

The unit of compute

A frontier training cluster in 2026 is built around NVIDIA's H100 or B200 Hopper/Blackwell-generation GPUs (or, increasingly, custom silicon: Google's TPUv5p, AWS Trainium, AMD MI300X). For concreteness let me use the H100 as the reference unit.

A single H100 SXM has the following profile:

Sustained power draw: ~700 W under training load
FP16 throughput: ~989 TFLOPs peak (~80% achievable in practice for dense workloads)
Memory: 80 GB HBM3 at 3 TB/s bandwidth
Price: ~$25,000-40,000 per unit at the OEM level (more in practice once you account for board and system integration)

The H100s are not used as individual cards. They are arranged in 8-GPU server nodes called HGX baseboards (NVIDIA's reference design). Each HGX node draws around 3.5 kW including the GPUs, CPUs, networking, and local power conversion losses.

A node is connected to other nodes via NVLink (intra-node, 900 GB/s per GPU) and InfiniBand (inter-node, typically 400 Gbps per link with several links per node). The networking fabric is non-trivial: a 1000-GPU pod typically requires a multi-tier fat-tree topology with NVIDIA Quantum-2 switches.

Cluster scale

For frontier model training in 2025-2026, the cluster sizes are:

Small / research scale: 1,000-4,000 GPUs (most academic and small lab work)
Production training, mid-tier: 8,000-16,000 GPUs (typical "good" model)
Frontier: 25,000-100,000 GPUs (GPT-4 class and successors)
The new builds being announced: 100,000-300,000 GPUs in single sites (Stargate, xAI's Memphis cluster, others)

A 100,000-GPU cluster has the following raw power and water profile:

Direct GPU power: 100,000 × 700 W = 70 MW
Total IT load (GPUs + CPUs + networking + storage): roughly 100-110 MW
Total facility load (IT load × PUE of ~1.3-1.4 for modern liquid-cooled designs): roughly 130-150 MW

For scale: 150 MW is roughly the power consumption of a city of 100,000 people in the US, or about 1.5 nuclear reactors at typical capacity factor. It is being dedicated, in many of these new facilities, to a single training run.

The water side

The cooling side is the second-order story that doesn't get nearly as much airtime as the power story.

Modern AI training clusters have moved aggressively toward direct-to-chip liquid cooling. The reason is purely thermodynamic: air cooling can't move 700 W out of a chip the size of a postage stamp at reasonable temperatures. You need water (or some dielectric coolant) in contact with cold plates bolted directly to the GPU.

The water from the cold plates is rejected through a heat exchanger to a chilled water loop, which is in turn cooled by either:

Cooling towers that reject heat via evaporation. This is the cheap, common option. Evaporation rates for cooling towers run roughly 1-5 gallons per kWh of heat rejected.
Dry coolers that reject heat by air convection (no water consumption). These are less efficient and require more electricity, so they're typically used only where water is scarce or expensive.
Closed-loop chillers with refrigerant. Energy-hungry but no water loss.

For a 150 MW cluster running on cooling towers at typical PUE, total water consumption works out to roughly 1.5-7 million gallons per day. Annualized, that's 0.5-2.5 billion gallons per year per facility.

This is a real number that local water authorities are starting to negotiate hard around. It is one of the reasons new training builds are being sited near reliable water supplies (the Pacific Northwest's hydro corridor, parts of west Texas with new water rights, Iceland with its geothermal-and-cold-air combination). It is also one of the reasons local opposition to new datacenter builds is becoming a real political force.

The energy bill

A 150 MW cluster running continuously consumes:

Per year: 150 MW × 8,760 hours = 1.314 TWh (terawatt-hours)
Cost at $0.05/kWh (cheap industrial power): ~$66 million per year in electricity alone
Cost at $0.08/kWh (more typical commercial industrial rate): ~$105 million per year

This is the operating power cost. Add facility lease or amortized building costs, network and storage, and salaries for the operations team, and the all-in non-chip cost of running a frontier cluster is on the order of $100-200 million per year.

The capex for the chips themselves is the much bigger number:

100,000 H100s × $30,000 = $3 billion in raw chip cost, before networking, storage, building, and integration
Total facility build cost for a greenfield 150 MW AI campus: typically quoted at $8-15 billion including land, building, power, cooling, networking, and the chips

The chips depreciate fast (3-4 year useful life is the typical assumption, possibly shorter as each chip generation moves faster). So the depreciation charge alone, on a fully-built cluster, is several billion dollars per year.

The cost of a training run

What does this all mean per training run? For a frontier-scale pretraining of a model:

Compute consumed: on the order of 10²⁵-10²⁶ FLOPs for current frontier models
Wall-clock time on a 25,000-GPU cluster: roughly 2-4 months
Power consumed: 100,000 GPU-months × 700 W × 730 hours/month = roughly 51 GWh for a 100K-GPU month-long run; ~$3-5M just in electricity
Cluster amortization for that time: roughly $200-500M depending on cluster pricing assumptions
Total cost of a frontier pretraining run: $500M to $1B+ in compute, before model engineers' salaries and data costs

For perspective: a frontier training run costs more than most pharma Phase III trials. It costs about as much as a midsize Hollywood movie. It produces an artifact (the trained weights) that occupies less than a terabyte on disk.

Where these things actually go

Siting decisions for new training clusters are driven by, in roughly this order:

Power availability. Not the rate, but whether the local grid can actually deliver the megawatts. Most US grids cannot rapidly stand up 150 MW of new load without years of transmission upgrades. The siting maps of the major hyperscalers cluster around regions with excess grid capacity (PJM in the east, ERCOT in Texas, the PNW with hydro overhang).
Power cost. Industrial rates below $0.05/kWh are increasingly hard to find in the US. Texas (wind, gas), the Pacific Northwest (hydro), Iceland (geothermal + hydro), and Quebec (hydro) all remain relatively cheap.
Water access. Per the cooling math above. New builds increasingly assume closed-loop / dry-cooled designs in water-stressed regions, accepting a slight efficiency hit.
Latency to user base. Less critical for training (where you're not serving real-time queries), more critical for inference. Training clusters can be anywhere; inference clusters need to be near population centers.
Permitting and politics. Increasingly a real constraint. Several recent proposed builds have been blocked or delayed by local opposition.

The result is that new training clusters in 2025-2026 are concentrated in: west Texas (Stargate, xAI Memphis, multiple Oracle/MSFT/AWS builds), the Pacific Northwest (Microsoft and Google build-outs), Phoenix metro (TSMC ecosystem plus several hyperscaler builds), Northern Virginia (the existing datacenter capital), Iceland and Norway (small but growing for European loads), and scattered Asian sites (Japan, Korea, Singapore).

What's left

At the end of this stage, we have a trained model: a set of weights, a few hundred gigabytes, sitting on object storage somewhere, ready to be deployed. All of the upstream investment, every kg of polysilicon, every wafer, every chip, every megawatt, every gallon of cooling water, has been turned into a single artifact.

The user, in Part 5, now types a query. The model runs one forward pass. One token comes out. We finally get to talk about what it actually costs.