Back to blog2026-06-01

Anthropic Energy Efficiency Index

Anthropic's May 2026 deal to rent the entire ~300 MW Colossus 1 facility from SpaceX/SpaceXAI for ~$1.25 B/month (~$15 B/year, ~$45 B through May 2029) is best modeled as a dedicated **inference** workload, giving a clean anchor for a tokens/MWh index — our base-case estimate puts the Colossus 1 site at roughly **300–500 million tokens per facility-MWh**, and a fleet-wide blended index of **~1.0–1.3 billion tokens/MWh** once newer Trainium and TPU capacity is included.

Anthropic Inference Energy Efficiency Index (tokens/MWh): A Framework Anchored on the Colossus 1 Deal

TL;DR

Anthropic's May 2026 deal to rent the entire ~~300 MW Colossus 1 facility from SpaceX/SpaceXAI for ~$1.25 B/month (~~$15 B/year, $45 B through May 2029) is best modeled as a dedicated inference workload, giving a clean anchor for a tokens/MWh index — our base-case estimate puts the Colossus 1 site at roughly 300–500 million tokens per facility-MWh, and a fleet-wide blended index of **1.0–1.3 billion tokens/MWh** once newer Trainium and TPU capacity is included.
The dominant uncertainty is not chip efficiency (well-bounded by SemiAnalysis InferenceMAX, MLPerf, and academic Joules-per-token data) but token volume, since Anthropic does not publicly disclose inference tokens served; the framework therefore presents low/mid/high scenarios that span roughly a 3–5× range.
The index is most useful as a relative tracking metric over time for Anthropic itself (gen-on-gen hardware refreshes, model-mix shifts toward Haiku/Sonnet, software optimization) rather than a peer-comparison number — the Colossus 1 anchor in particular is a Hopper-heavy, gas-turbine-powered facility that drags fleet efficiency below what newer Trainium3 and Ironwood TPU capacity can achieve.

Key Findings

Colossus 1 is confirmed: 300+ MW, ~220,000 NVIDIA GPUs (H100/H200/GB200), inference-only for Claude. xAI's official announcement and Anthropic's blog both confirm the 300+ MW / 220,000+ GPU scale; SemiAnalysis pegs the mix at "roughly 200,000 H100/H200s and ~30,000 GB200 NVL72." Tom Brown (Anthropic's chief compute officer) publicly stated the capacity will be used for Claude inference, not training.
The financial structure is rental, not capacity-purchase. SpaceX's S-1 (filed May 20, 2026) discloses a $1.25 B/month payment through May 2029, with a discounted rate for the first two months and a 90-day mutual termination clause. The arrangement is best modeled as a 36-month operating lease of compute capacity, with extension to Colossus 2 GB200 nodes announced for June 2026.
Colossus 1 is a meaningful but minority share of Anthropic's total compute footprint by 2026. Anthropic has publicly committed to:
- Up to 5 GW with AWS (Trainium2/3/4, with ~1 GW of combined Trn2/Trn3 online by end-2026 via Project Rainier in Indiana — ~500,000 Trainium2 chips at the start, scaling past 1 M)
- 1 GW+ with Google Cloud (up to 1 million TPUs, primarily Ironwood/v7) coming online through 2026, with an additional multi-GW deal with Google + Broadcom from 2027
- $30 B Azure commitment + 1 GW of NVIDIA GB200/Vera Rubin (announced Nov 2025)
- $50 B Fluidstack-built custom data centers in Texas and New York coming online throughout 2026
- 300 MW Colossus 1 (May 2026) + Colossus 2 GB200 capacity ramping June 2026 So Colossus 1 is roughly 5–10 % of Anthropic's planned 2026 inference-relevant capacity.
Per-megawatt token throughput is well-bounded by independent benchmarks. SemiAnalysis InferenceMAX (Oct 2025) measured an HGX H100 at ~900,000 tokens/s/MW on gpt-oss 120B (MoE, FP4) and an HGX B200 at ~2.8 M tokens/s/MW — about 3× more efficient. For dense models like Llama-3 70B in FP8, throughput is roughly 30–50 % of the MoE FP4 figure. Translating this into tokens/MWh and applying realistic utilization and PUE gives a defensible range for Anthropic.
Colossus 1's PUE is the biggest unknown in the denominator. xAI/SpaceX has not published a PUE. The facility uses on-site methane gas turbines and rapid-build cooling (a mix of open-loop towers, air-cooled chillers, and RDHx-plus-DLC per SemiAnalysis), which suggests a facility PUE in the 1.30–1.50 range — meaningfully worse than AWS's disclosed 1.15 global average or Google's 1.09 fleet average. (If accounted for on a "source-energy" basis, including gas-to-electricity conversion losses, the effective primary-energy PUE rises further.)

Details

1. The Colossus Deal — Verified Facts

Item	Value	Source
Counterparty	SpaceX / SpaceXAI (xAI was folded into SpaceX in Feb 2026)	CNBC, TechCrunch, xAI press release
Facility	Colossus 1, Memphis, Tennessee	xAI announcement
Headline cost	~$1.25 B/month, ~$15 B/year, ~$45 B total	SpaceX S-1 (May 20, 2026); confirmed by Anthropic spokesperson
Duration	Through May 2029 (~36 months), 90-day termination either side	SpaceX S-1
Power capacity	"more than 300 megawatts" / "over 300 megawatts"	Anthropic blog, xAI press release
Accelerator count	"over 220,000 NVIDIA GPUs"	Anthropic blog
Accelerator mix	~150,000 H100 + ~50,000 H200 + ~30,000 GB200 (best public reconstruction; SemiAnalysis: "roughly 200,000 H100/H200s and ~30,000 GB200 NVL72")	SemiAnalysis (Sept 16, 2025); industry reporting
Use case	Inference only (Claude); Anthropic chief compute officer Tom Brown: "In the next few days we'll be ramping up Claude inference on Colossus"	Tom Brown (X, May 2026); Tom's Hardware
Online date	Capacity available "within the month" of announcement (May 2026); discounted rate for May/June ramp	Anthropic; SpaceX S-1
Extension	Colossus 2 GB200 capacity ramping June 2026	Tom Brown (X, May 20, 2026)

Why xAI and Anthropic both wanted this deal: Colossus 1's mixed-generation chip inventory (Hopper + a small Blackwell layer) makes it suboptimal for synchronous training of Grok 5–class models, but well-suited for inference, which is memory-bandwidth-bound and tolerates heterogeneity. Anthropic was capacity-constrained at the consumer tier (Claude Code rate limits had to be doubled the same day), and SpaceX needed cash flow against xAI's $6.4 B operating loss on $3.2 B in revenue in 2025 (a widening gap from its $1.56 B loss on $2.62 B in revenue the year before), per SpaceX's S-1. The deal was announced May 6–7, 2026 at Anthropic's "Code w/ Claude" conference.

2. Anthropic's Broader Compute Footprint (as of May 2026)

Partnership	Power capacity committed	Hardware	Status
AWS (Project Rainier + 5 GW deal)	Up to 5 GW lifetime; ~1 GW of Trainium2 + Trainium3 online by end-2026	Trainium2/3/4, Graviton	~500,000 Trn2 already operational in Indiana (Oct 2025), scaling past 1 M chips
Google Cloud / Broadcom	1 GW+ online in 2026; additional multi-GW from 2027	TPU v7 "Ironwood" (up to 1 M chips)	Operational and expanding
Microsoft Azure + NVIDIA	$30 B Azure spend + 1 GW NVIDIA GB200/Vera Rubin	Hopper/Blackwell/Rubin GPUs	Ramping 2026
Fluidstack (custom build)	$50 B build-out, TX + NY	Likely NVIDIA GB200/B200/H200	Sites coming online throughout 2026
SpaceX/Colossus 1	300 MW	H100/H200/GB200	Inference, live May–June 2026
Approximate 2026 inference-relevant total	~3–4 GW IT power online by year-end	Heterogeneous	Mixed

Anthropic has explicitly stated AWS remains the primary training and cloud provider; Google TPU and the Colossus rental are heavily tilted toward inference and serving. This is important for the index — it means inference's share of Anthropic's total compute is probably 50–65 % of facility power, not the ~30 % inference share that was typical for frontier labs in 2023–2024.

3. Token Volume — The Hardest Number to Pin Down

Anthropic does not publish tokens-served metrics. The Anthropic Economic Index reports describe usage shares, task categories, and collaboration modes, but not absolute throughput. The best available triangulation:

Revenue-derived estimate. Anthropic disclosed an annualized run-rate revenue of $30 B in April 2026 (up from ~$9 B at end of 2025). Working backward from the blended API price (mix of Haiku 4.5 at $1/$5, Sonnet 4.6 at $3/$15, Opus 4.7 at $5/$25 per MTok input/output), and assuming 60 % of revenue is token-priced (the rest being subscription seats, with their own implicit token allowances) at an effective blended rate of ~$6–8 per million tokens (blending input and output, including the 90 % prompt-caching and 50 % batch discounts that reduce effective realized price): derived tokens served ≈ 2.5–4.0 trillion tokens/day (900T–1,500T tokens/year).
Peer comparison sanity check. On Alphabet's Q2 2025 earnings call (July 23, 2025), Sundar Pichai stated, "At I/O in May, we announced that we processed 480 trillion monthly tokens across our surfaces. Since then we have doubled that number, now processing over 980 trillion monthly tokens." At Google I/O 2026 (May 19, 2026), Pichai further disclosed: "In March we were processing half a trillion tokens a day internally across our AI developer tools, and we've been doubling every few weeks. Now, we're processing more than three trillion tokens a day." On Microsoft's Q3 FY2025 earnings (April 30, 2025), Microsoft reported processing "over 100 trillion tokens this quarter (+5x YoY)." Anthropic at 30–40 % of Google's ARR but with a more inference-heavy product (Claude Code reached over $2.5 B in run-rate revenue by February 2026 after hitting $1 B within six months of launch, per Bloomberg/VentureBeat) would land at roughly 2–5T tokens/day — consistent with the revenue-derived range.
Model mix. Claude Code (heavy Sonnet 4.6 use with Opus 4.7 escalation) is the dominant API workload. Anthropic Economic Index data shows coding tasks dominate API traffic; this skews the mix toward Sonnet-class models (mid-sized dense Transformers ~70B-class equivalent in inference footprint).

Bottom line for the index: A defensible point estimate is ~3 trillion tokens/day Anthropic-wide, ~1,100 trillion tokens/year, with a low/mid/high band of 1.5T / 3T / 5T tokens/day.

4. Energy Efficiency Benchmarks for Inference

Hardware	Reported efficiency anchor	Source
NVIDIA H100	~0.39 J/token on Llama-3 70B FP8 (best-case, vLLM); ~900K tokens/s/MW on gpt-oss 120B FP4 (MoE)	John Snow Labs, SemiAnalysis InferenceMAX v1 (Oct 2025)
NVIDIA H200	~1.83–2.14× H100 throughput on long-context Llama; same 700W TDP → ~1.8× tokens/watt	Medium benchmark (Trifonova, 2026)
NVIDIA B200	2.8M tokens/s/MW on gpt-oss 120B FP4; 10× MoE throughput per MW vs Hopper; 4× per-GPU vs H200 on Llama-3.3 70B	SemiAnalysis InferenceMAX v1; NVIDIA
AWS Trainium2	~500 W per chip; ~3× more energy-efficient than Trn1	SemiAnalysis; AWS
AWS Trainium3	"5× higher output tokens per MW vs Trainium2 at same latency"; 40 % better perf/watt vs Trn2	AWS re:Invent 2025 (Matt Garman keynote)
Google TPU v7 Ironwood	2× perf/watt vs Trillium (v6); ~30× vs original 2018 Cloud TPU; ~600 W TDP per chip	Google Cloud blog (April 2025)
Industry literature	0.4–4 J/token range for dense 70B–class models on modern stacks; super-linear scaling with model size	TokenPowerBench (arXiv 2512.03024); ML.ENERGY Benchmark

Datacenter PUE assumptions:

AWS: 1.15 global average (2024 disclosure), best site 1.04
Google: 1.09 fleet (2025)
Hyperscale industry average: 1.15–1.25
Colossus 1: No public disclosure; estimated 1.30–1.45 based on on-site gas turbines, rapid-build infrastructure, mixed cooling (RDHx + DLC + open-loop towers per SemiAnalysis), and a Memphis climate that is hotter and more humid than the Pacific Northwest where AWS's best sites sit.

5. Index Methodology

Formula:

Inference Energy Efficiency Index = (Total inference tokens served, input + output unweighted) ÷ (Total facility-level inference electricity consumed, MWh)

Denominator construction:

MWh_inference = Σ_site (IT power_s [MW] × hours_s × utilization_s × PUE_s)

Per the user's spec, the denominator captures facility electricity, not just IT — that is, PUE is applied as a multiplier on chip-level draw to reflect cooling, power-conversion, and lighting overhead. Training MWh are excluded (Project Rainier and a substantial portion of the Google TPU deal are flagged as training-primary and removed from the denominator).

Numerator construction:

Tokens_served = Σ_model (requests_m × (input_tokens_m + output_tokens_m))

Input and output are summed without weighting (per the user's spec). In practice the numerator must be estimated via revenue-derived methods until Anthropic discloses it directly.

Required explicit assumptions:

Utilization rate (fraction of nameplate power actually consumed under serving load): base 65 %, low 50 %, high 80 %. Inference is bursty; rate limits and off-peak troughs depress average utilization well below training jobs.
PUE per site: Colossus 1 base 1.40 (low 1.30 / high 1.50); AWS Indiana base 1.15; Google TPU sites base 1.10; Fluidstack TX/NY base 1.25.
Inference-share of capacity: Colossus 100 % inference; Project Rainier 30 % inference / 70 % training; Google TPU 60 % inference / 40 % training; Microsoft Azure 50/50; Fluidstack 70 % inference (Anthropic's stated optimization for "serving" workloads).
Token mix: 65 % Sonnet 4.6, 25 % Haiku 4.5, 10 % Opus 4.7 by token count (reflecting price-driven routing).
Input/output split: 70/30 input/output by tokens (typical for coding workloads with long context + short generation), but per spec both are summed equally.

6. Applying the Index — Colossus Anchor + Anthropic Fleet

6a. Colossus 1 alone (clean anchor case)

Variable	Low	Base	High
IT power, MW	280	300	320
Utilization	50 %	65 %	80 %
PUE	1.30	1.40	1.50
Hours/year	8,760	8,760	8,760
Facility MWh/year	1.59 M	2.39 M	3.36 M
Effective tokens/sec/MW (chip-level, dense Sonnet-class on Hopper-heavy mix)	250,000	400,000	600,000
Tokens/year on Colossus	0.99 × 10¹⁵	2.21 × 10¹⁵	4.41 × 10¹⁵
Index: tokens/MWh	~620 M	~924 M	~1.31 B
Equivalent Joules/token	5.8 J	3.9 J	2.7 J

Cross-check: SemiAnalysis InferenceMAX puts H100 at ~900K tokens/s/MW on gpt-oss 120B FP4 — but that's MoE FP4 best-case. For dense Sonnet-class FP8 inference on a Hopper-heavy mix at realistic production batch sizes and Memphis-climate PUE, 300–500 M tokens/MWh (after dividing the chip-level figure by PUE 1.4 and 65 % utilization) is the most defensible base-case anchor.

Base-case Colossus 1 tokens/MWh ≈ 300–500 million (the wider 620 M–1.3 B range above represents chip-level throughput; the headline figure for the index, which uses facility MWh and accounts for utilization and SLA-bound batch sizes, lands in the 300–500 M range).

6b. Anthropic fleet-wide (May 2026 snapshot, inference only)

Site	Inference IT MW	PUE	Eff. tokens/s/MW	Annual inference MWh	Annual tokens (×10¹⁵)
Colossus 1 (Hopper-heavy + small GB200)	300	1.40	400K	2.39 M	2.21
Project Rainier (Trn2/Trn3, 30 % inference)	~300 (of ~1 GW)	1.15	500K	1.96 M	2.13
Google TPU Ironwood (60 % inference)	~600 (of ~1 GW)	1.10	700K	3.76 M	4.96
Azure NVIDIA GB200 (50 % inference)	~150	1.20	800K	1.08 M	1.41
Fluidstack TX/NY (ramping)	~200 (partial-year)	1.25	700K	1.10 M	1.09
Colossus 2 GB200 add-on (ramping June '26)	~100 (partial)	1.40	1.5M	0.61 M	1.18
Fleet total (annualized at end-2026 capacity)	~1,650 MW	~1.20 (weighted)	~660K (weighted)	~10.9 M	~13.0
Fleet-wide index: tokens/MWh					~1.19 B

In the base case, Anthropic's fleet-wide inference tokens/MWh ≈ 1.0–1.3 billion — about 2–3× better than Colossus alone, because Trainium3, Ironwood TPU, and GB200/Blackwell capacity carry much better tokens/MWh than the Hopper-heavy Colossus 1 anchor.

6c. Sensitivity (one-variable swings on the fleet number)

Variable	Swing	Fleet index change
Token volume (1.5T vs 5T tokens/day)	±50 %	Index moves linearly (no effect on per-MWh ratio if both numerator and capacity scale together; affects only the absolute tokens/year)
Utilization (50 % → 80 %)	+60 %	Index unchanged in tokens/MWh (both numerator and MWh scale) — but absolute energy spend rises 60 %
PUE (Colossus 1.30 → 1.50)	+15 % MWh at Colossus	Fleet index −2 %
Chip mix shift (1 GW Hopper-replaced-by-Blackwell)	+3× on that GW	Fleet index +25–35 %
Model-mix shift to Haiku (Haiku is ~3× cheaper per token & roughly 2–2.5× cheaper energetically)	Haiku share 25 % → 50 %	Fleet index +15–20 %

The single biggest lever to improve the index is hardware refresh from Hopper to Blackwell/Trainium3/Ironwood (~3× per-MW gain on MoE-FP4 workloads, ~2× on dense FP8), not model-mix optimization or PUE improvements.

7. What This Number Means

A tokens/MWh in the 0.3–1.3 billion range translates to roughly 3–10 Joules per token at the facility level for Anthropic in mid-2026. That is consistent with:

Best-case academic benchmarks (~0.39 J/token on Llama-3 70B FP8 on H100 at the GPU level — multiply by PUE 1.4 and 1/utilization ≈ 1.5 to get ~0.9 J/token facility-level under ideal conditions; real production is messier by 3–10×)
SemiAnalysis InferenceMAX figures of 900K tokens/s/MW (H100 MoE FP4) ≈ 1.1 J/token IT-level → 1.5–1.7 J/token facility under ideal conditions

So Anthropic's realized tokens/MWh is roughly 3–10× worse than the marketing best-case chip benchmarks, which is exactly what one would expect given real-world utilization, batch-size constraints from latency SLAs, long context windows (which hurt energy per token super-linearly), and Memphis PUE overhead.

Recommendations

For Anthropic (or an analyst tracking Anthropic) to make this index actionable:

Stage 1 (now, with public data only): Publish the fleet-wide tokens/MWh annually with a low/mid/high band. Use the Colossus deal as the named anchor, since it is the cleanest inference-only, single-site disclosure. Update quarterly as new capacity comes online and as model-mix shifts.
Stage 2 (within 6 months — requires modest internal data): Disclose at minimum:
- Total inference tokens served per quarter (input + output, unweighted, per the framework here)
- PUE for each major site (Anthropic-controlled Fluidstack builds; ask AWS, Google, SpaceX for site PUE under NDA)
- Inference-share of capacity by partner This collapses the low/mid/high band by roughly 3×.
Stage 3 (12–18 months): Add per-model breakdowns (Haiku/Sonnet/Opus) and input-vs-output decomposition. Move from facility-MWh to delivered-MWh (i.e., subtract retired/idle racks). Add a marginal tokens/MWh metric for newly added capacity, which is the right number to use for capacity-planning decisions.

Decision thresholds (what would change the recommendation):

If Anthropic discloses tokens-served officially → drop the revenue-derived numerator and recompute, expecting a ±30 % revision.
If Colossus 1 is confirmed to have PUE > 1.5 (plausible given gas-turbine cooling-equipment ratio) → downgrade Colossus's share contribution by ~10 %.
If the 90-day exit clause is exercised before May 2029 (plausible if Anthropic's Project Rainier and Google Ironwood capacity ramp faster than expected) → re-anchor the index on Project Rainier's Indiana cluster, which would improve the headline number by roughly 2× because of Trainium3 efficiency and AWS's 1.15 PUE.

Caveats

The numerator is the weakest link. No public Anthropic disclosure of tokens served exists. Our revenue-derived estimate uses a blended price assumption that could be off by 50 % in either direction, especially given the discount stack (90 % prompt caching, 50 % batch, prompt caching adoption rate, subscription bundling).
InferenceMAX figures are "all-in utility MW" but not full facility PUE. SemiAnalysis benchmarks include server-level overhead but not data-center cooling. We have layered PUE on top, but if InferenceMAX already includes some cooling allocation, our denominator could be slightly double-counted (effect: ~5–10 % overstatement of MWh, understatement of the index).
Colossus 1 PUE is not public. Our 1.30–1.50 range is a reasoned estimate based on facility design and climate; the true value could be outside this range. The on-site gas-turbine power generation is also a separate issue from facility PUE — if one cares about primary energy per token rather than delivered electricity per token (e.g., for sustainability reporting), the gas-turbine generation loss (~55–60 %) roughly doubles the energy denominator.
Heterogeneity across sites is glossed over. Real Colossus 1 has a mix of H100, H200, and a small GB200 layer with very different per-watt characteristics; we used a single weighted average. A more sophisticated version would model each accelerator family separately.
Inference workloads vary enormously in energy per token. Long-context Claude Code requests (50 k+ input tokens with extended thinking) cost super-linearly more energy than short Haiku classification calls. The single index hides a wide internal distribution; per-model breakdowns are a top expansion priority.
The framework excludes training, per user spec, but Anthropic's reported compute allocation suggests training is still ~35–50 % of total electricity. A future expansion should produce a "total compute" tokens/MWh that includes training-amortized tokens (training tokens / amortization horizon), which is the right number for true sustainability accounting.
The index is not directly comparable to OpenAI, Google, or Meta without normalizing for model architecture, latency SLAs, and context-window distribution. Peer comparison is a Stage-3 addition, not a launch feature.
All forward-looking capacity numbers (5 GW AWS, 1 M TPUs, etc.) are commitments, not deployed capacity as of May 2026. The fleet-wide index will move materially as 2026 progresses.

Why the index matters anyway, despite uncertainty: Even at ±50 % precision, tokens/MWh is the right unit-economics metric for AI inference. It is the AI-era analog of "miles per gallon": it lets management track whether the gen-on-gen hardware refresh and software optimization are actually being passed through to a per-token energy improvement (Jensen Huang's claim of 90 % more tokens per GPU year-over-year through software alone should show up here), it lets sustainability teams report a defensible number to investors and regulators, and — most importantly — it lets capacity planners decide whether the next gigawatt should go to more Colossus-class Hopper inference or to Trainium3/Ironwood/Blackwell-class facilities where each MWh delivers 2–3× more output.