Anthropic Energy Efficiency Index
Anthropic's May 2026 deal to rent the entire ~300 MW Colossus 1 facility from SpaceX/SpaceXAI for ~$1.25 B/month (~$15 B/year, ~$45 B through May 2029) is best modeled as a dedicated **inference** workload, giving a clean anchor for a tokens/MWh index — our base-case estimate puts the Colossus 1 site at roughly **300–500 million tokens per facility-MWh**, and a fleet-wide blended index of **~1.0–1.3 billion tokens/MWh** once newer Trainium and TPU capacity is included.
Anthropic Inference Energy Efficiency Index (tokens/MWh): A Framework Anchored on the Colossus 1 Deal
TL;DR
- Anthropic's May 2026 deal to rent the entire
300 MW Colossus 1 facility from SpaceX/SpaceXAI for ~$1.25 B/month ($15 B/year,$45 B through May 2029) is best modeled as a dedicated inference workload, giving a clean anchor for a tokens/MWh index — our base-case estimate puts the Colossus 1 site at roughly 300–500 million tokens per facility-MWh, and a fleet-wide blended index of **1.0–1.3 billion tokens/MWh** once newer Trainium and TPU capacity is included. - The dominant uncertainty is not chip efficiency (well-bounded by SemiAnalysis InferenceMAX, MLPerf, and academic Joules-per-token data) but token volume, since Anthropic does not publicly disclose inference tokens served; the framework therefore presents low/mid/high scenarios that span roughly a 3–5× range.
- The index is most useful as a relative tracking metric over time for Anthropic itself (gen-on-gen hardware refreshes, model-mix shifts toward Haiku/Sonnet, software optimization) rather than a peer-comparison number — the Colossus 1 anchor in particular is a Hopper-heavy, gas-turbine-powered facility that drags fleet efficiency below what newer Trainium3 and Ironwood TPU capacity can achieve.
Key Findings
- Colossus 1 is confirmed: 300+ MW, ~220,000 NVIDIA GPUs (H100/H200/GB200), inference-only for Claude. xAI's official announcement and Anthropic's blog both confirm the 300+ MW / 220,000+ GPU scale; SemiAnalysis pegs the mix at "roughly 200,000 H100/H200s and ~30,000 GB200 NVL72." Tom Brown (Anthropic's chief compute officer) publicly stated the capacity will be used for Claude inference, not training.
- The financial structure is rental, not capacity-purchase. SpaceX's S-1 (filed May 20, 2026) discloses a $1.25 B/month payment through May 2029, with a discounted rate for the first two months and a 90-day mutual termination clause. The arrangement is best modeled as a 36-month operating lease of compute capacity, with extension to Colossus 2 GB200 nodes announced for June 2026.
- Colossus 1 is a meaningful but minority share of Anthropic's total compute footprint by 2026. Anthropic has publicly committed to:
- Up to 5 GW with AWS (Trainium2/3/4, with ~1 GW of combined Trn2/Trn3 online by end-2026 via Project Rainier in Indiana — ~500,000 Trainium2 chips at the start, scaling past 1 M)
- 1 GW+ with Google Cloud (up to 1 million TPUs, primarily Ironwood/v7) coming online through 2026, with an additional multi-GW deal with Google + Broadcom from 2027
- $30 B Azure commitment + 1 GW of NVIDIA GB200/Vera Rubin (announced Nov 2025)
- $50 B Fluidstack-built custom data centers in Texas and New York coming online throughout 2026
- 300 MW Colossus 1 (May 2026) + Colossus 2 GB200 capacity ramping June 2026 So Colossus 1 is roughly 5–10 % of Anthropic's planned 2026 inference-relevant capacity.
- Per-megawatt token throughput is well-bounded by independent benchmarks. SemiAnalysis InferenceMAX (Oct 2025) measured an HGX H100 at ~900,000 tokens/s/MW on gpt-oss 120B (MoE, FP4) and an HGX B200 at ~2.8 M tokens/s/MW — about 3× more efficient. For dense models like Llama-3 70B in FP8, throughput is roughly 30–50 % of the MoE FP4 figure. Translating this into tokens/MWh and applying realistic utilization and PUE gives a defensible range for Anthropic.
- Colossus 1's PUE is the biggest unknown in the denominator. xAI/SpaceX has not published a PUE. The facility uses on-site methane gas turbines and rapid-build cooling (a mix of open-loop towers, air-cooled chillers, and RDHx-plus-DLC per SemiAnalysis), which suggests a facility PUE in the 1.30–1.50 range — meaningfully worse than AWS's disclosed 1.15 global average or Google's 1.09 fleet average. (If accounted for on a "source-energy" basis, including gas-to-electricity conversion losses, the effective primary-energy PUE rises further.)
Details
1. The Colossus Deal — Verified Facts
| Item | Value | Source |
|---|---|---|
| Counterparty | SpaceX / SpaceXAI (xAI was folded into SpaceX in Feb 2026) | CNBC, TechCrunch, xAI press release |
| Facility | Colossus 1, Memphis, Tennessee | xAI announcement |
| Headline cost | ~$1.25 B/month, ~$15 B/year, ~$45 B total | SpaceX S-1 (May 20, 2026); confirmed by Anthropic spokesperson |
| Duration | Through May 2029 (~36 months), 90-day termination either side | SpaceX S-1 |
| Power capacity | "more than 300 megawatts" / "over 300 megawatts" | Anthropic blog, xAI press release |
| Accelerator count | "over 220,000 NVIDIA GPUs" | Anthropic blog |
| Accelerator mix | ~150,000 H100 + ~50,000 H200 + ~30,000 GB200 (best public reconstruction; SemiAnalysis: "roughly 200,000 H100/H200s and ~30,000 GB200 NVL72") | SemiAnalysis (Sept 16, 2025); industry reporting |
| Use case | Inference only (Claude); Anthropic chief compute officer Tom Brown: "In the next few days we'll be ramping up Claude inference on Colossus" | Tom Brown (X, May 2026); Tom's Hardware |
| Online date | Capacity available "within the month" of announcement (May 2026); discounted rate for May/June ramp | Anthropic; SpaceX S-1 |
| Extension | Colossus 2 GB200 capacity ramping June 2026 | Tom Brown (X, May 20, 2026) |
Why xAI and Anthropic both wanted this deal: Colossus 1's mixed-generation chip inventory (Hopper + a small Blackwell layer) makes it suboptimal for synchronous training of Grok 5–class models, but well-suited for inference, which is memory-bandwidth-bound and tolerates heterogeneity. Anthropic was capacity-constrained at the consumer tier (Claude Code rate limits had to be doubled the same day), and SpaceX needed cash flow against xAI's $6.4 B operating loss on $3.2 B in revenue in 2025 (a widening gap from its $1.56 B loss on $2.62 B in revenue the year before), per SpaceX's S-1. The deal was announced May 6–7, 2026 at Anthropic's "Code w/ Claude" conference.
2. Anthropic's Broader Compute Footprint (as of May 2026)
| Partnership | Power capacity committed | Hardware | Status |
|---|---|---|---|
| AWS (Project Rainier + 5 GW deal) | Up to 5 GW lifetime; ~1 GW of Trainium2 + Trainium3 online by end-2026 | Trainium2/3/4, Graviton | ~500,000 Trn2 already operational in Indiana (Oct 2025), scaling past 1 M chips |
| Google Cloud / Broadcom | 1 GW+ online in 2026; additional multi-GW from 2027 | TPU v7 "Ironwood" (up to 1 M chips) | Operational and expanding |
| Microsoft Azure + NVIDIA | $30 B Azure spend + 1 GW NVIDIA GB200/Vera Rubin | Hopper/Blackwell/Rubin GPUs | Ramping 2026 |
| Fluidstack (custom build) | $50 B build-out, TX + NY | Likely NVIDIA GB200/B200/H200 | Sites coming online throughout 2026 |
| SpaceX/Colossus 1 | 300 MW | H100/H200/GB200 | Inference, live May–June 2026 |
| Approximate 2026 inference-relevant total | ~3–4 GW IT power online by year-end | Heterogeneous | Mixed |
Anthropic has explicitly stated AWS remains the primary training and cloud provider; Google TPU and the Colossus rental are heavily tilted toward inference and serving. This is important for the index — it means inference's share of Anthropic's total compute is probably 50–65 % of facility power, not the ~30 % inference share that was typical for frontier labs in 2023–2024.
3. Token Volume — The Hardest Number to Pin Down
Anthropic does not publish tokens-served metrics. The Anthropic Economic Index reports describe usage shares, task categories, and collaboration modes, but not absolute throughput. The best available triangulation:
- Revenue-derived estimate. Anthropic disclosed an annualized run-rate revenue of
$30 B in April 2026 (up from ~$9 B at end of 2025). Working backward from the blended API price (mix of Haiku 4.5 at $1/$5, Sonnet 4.6 at $3/$15, Opus 4.7 at $5/$25 per MTok input/output), and assuming 60 % of revenue is token-priced (the rest being subscription seats, with their own implicit token allowances) at an effective blended rate of ~$6–8 per million tokens (blending input and output, including the 90 % prompt-caching and 50 % batch discounts that reduce effective realized price): derived tokens served ≈ 2.5–4.0 trillion tokens/day (900T–1,500T tokens/year). - Peer comparison sanity check. On Alphabet's Q2 2025 earnings call (July 23, 2025), Sundar Pichai stated, "At I/O in May, we announced that we processed 480 trillion monthly tokens across our surfaces. Since then we have doubled that number, now processing over 980 trillion monthly tokens." At Google I/O 2026 (May 19, 2026), Pichai further disclosed: "In March we were processing half a trillion tokens a day internally across our AI developer tools, and we've been doubling every few weeks. Now, we're processing more than three trillion tokens a day." On Microsoft's Q3 FY2025 earnings (April 30, 2025), Microsoft reported processing "over 100 trillion tokens this quarter (+5x YoY)." Anthropic at 30–40 % of Google's ARR but with a more inference-heavy product (Claude Code reached over $2.5 B in run-rate revenue by February 2026 after hitting $1 B within six months of launch, per Bloomberg/VentureBeat) would land at roughly 2–5T tokens/day — consistent with the revenue-derived range.
- Model mix. Claude Code (heavy Sonnet 4.6 use with Opus 4.7 escalation) is the dominant API workload. Anthropic Economic Index data shows coding tasks dominate API traffic; this skews the mix toward Sonnet-class models (mid-sized dense Transformers ~70B-class equivalent in inference footprint).
Bottom line for the index: A defensible point estimate is ~3 trillion tokens/day Anthropic-wide, ~1,100 trillion tokens/year, with a low/mid/high band of 1.5T / 3T / 5T tokens/day.
4. Energy Efficiency Benchmarks for Inference
| Hardware | Reported efficiency anchor | Source |
|---|---|---|
| NVIDIA H100 | ~0.39 J/token on Llama-3 70B FP8 (best-case, vLLM); ~900K tokens/s/MW on gpt-oss 120B FP4 (MoE) | John Snow Labs, SemiAnalysis InferenceMAX v1 (Oct 2025) |
| NVIDIA H200 | ~1.83–2.14× H100 throughput on long-context Llama; same 700W TDP → ~1.8× tokens/watt | Medium benchmark (Trifonova, 2026) |
| NVIDIA B200 | 2.8M tokens/s/MW on gpt-oss 120B FP4; 10× MoE throughput per MW vs Hopper; 4× per-GPU vs H200 on Llama-3.3 70B | SemiAnalysis InferenceMAX v1; NVIDIA |
| AWS Trainium2 | ~500 W per chip; ~3× more energy-efficient than Trn1 | SemiAnalysis; AWS |
| AWS Trainium3 | "5× higher output tokens per MW vs Trainium2 at same latency"; 40 % better perf/watt vs Trn2 | AWS re:Invent 2025 (Matt Garman keynote) |
| Google TPU v7 Ironwood | 2× perf/watt vs Trillium (v6); ~30× vs original 2018 Cloud TPU; ~600 W TDP per chip | Google Cloud blog (April 2025) |
| Industry literature | 0.4–4 J/token range for dense 70B–class models on modern stacks; super-linear scaling with model size | TokenPowerBench (arXiv 2512.03024); ML.ENERGY Benchmark |
Datacenter PUE assumptions:
- AWS: 1.15 global average (2024 disclosure), best site 1.04
- Google: 1.09 fleet (2025)
- Hyperscale industry average: 1.15–1.25
- Colossus 1: No public disclosure; estimated 1.30–1.45 based on on-site gas turbines, rapid-build infrastructure, mixed cooling (RDHx + DLC + open-loop towers per SemiAnalysis), and a Memphis climate that is hotter and more humid than the Pacific Northwest where AWS's best sites sit.
5. Index Methodology
Formula:
Inference Energy Efficiency Index = (Total inference tokens served, input + output unweighted) ÷ (Total facility-level inference electricity consumed, MWh)
Denominator construction:
MWh_inference = Σ_site (IT power_s [MW] × hours_s × utilization_s × PUE_s)
Per the user's spec, the denominator captures facility electricity, not just IT — that is, PUE is applied as a multiplier on chip-level draw to reflect cooling, power-conversion, and lighting overhead. Training MWh are excluded (Project Rainier and a substantial portion of the Google TPU deal are flagged as training-primary and removed from the denominator).
Numerator construction:
Tokens_served = Σ_model (requests_m × (input_tokens_m + output_tokens_m))
Input and output are summed without weighting (per the user's spec). In practice the numerator must be estimated via revenue-derived methods until Anthropic discloses it directly.
Required explicit assumptions:
- Utilization rate (fraction of nameplate power actually consumed under serving load): base 65 %, low 50 %, high 80 %. Inference is bursty; rate limits and off-peak troughs depress average utilization well below training jobs.
- PUE per site: Colossus 1 base 1.40 (low 1.30 / high 1.50); AWS Indiana base 1.15; Google TPU sites base 1.10; Fluidstack TX/NY base 1.25.
- Inference-share of capacity: Colossus 100 % inference; Project Rainier 30 % inference / 70 % training; Google TPU 60 % inference / 40 % training; Microsoft Azure 50/50; Fluidstack 70 % inference (Anthropic's stated optimization for "serving" workloads).
- Token mix: 65 % Sonnet 4.6, 25 % Haiku 4.5, 10 % Opus 4.7 by token count (reflecting price-driven routing).
- Input/output split: 70/30 input/output by tokens (typical for coding workloads with long context + short generation), but per spec both are summed equally.
6. Applying the Index — Colossus Anchor + Anthropic Fleet
6a. Colossus 1 alone (clean anchor case)
| Variable | Low | Base | High |
|---|---|---|---|
| IT power, MW | 280 | 300 | 320 |
| Utilization | 50 % | 65 % | 80 % |
| PUE | 1.30 | 1.40 | 1.50 |
| Hours/year | 8,760 | 8,760 | 8,760 |
| Facility MWh/year | 1.59 M | 2.39 M | 3.36 M |
| Effective tokens/sec/MW (chip-level, dense Sonnet-class on Hopper-heavy mix) | 250,000 | 400,000 | 600,000 |
| Tokens/year on Colossus | 0.99 × 10¹⁵ | 2.21 × 10¹⁵ | 4.41 × 10¹⁵ |
| Index: tokens/MWh | ~620 M | ~924 M | ~1.31 B |
| Equivalent Joules/token | 5.8 J | 3.9 J | 2.7 J |
Cross-check: SemiAnalysis InferenceMAX puts H100 at ~900K tokens/s/MW on gpt-oss 120B FP4 — but that's MoE FP4 best-case. For dense Sonnet-class FP8 inference on a Hopper-heavy mix at realistic production batch sizes and Memphis-climate PUE, 300–500 M tokens/MWh (after dividing the chip-level figure by PUE 1.4 and 65 % utilization) is the most defensible base-case anchor.
Base-case Colossus 1 tokens/MWh ≈ 300–500 million (the wider 620 M–1.3 B range above represents chip-level throughput; the headline figure for the index, which uses facility MWh and accounts for utilization and SLA-bound batch sizes, lands in the 300–500 M range).
6b. Anthropic fleet-wide (May 2026 snapshot, inference only)
| Site | Inference IT MW | PUE | Eff. tokens/s/MW | Annual inference MWh | Annual tokens (×10¹⁵) |
|---|---|---|---|---|---|
| Colossus 1 (Hopper-heavy + small GB200) | 300 | 1.40 | 400K | 2.39 M | 2.21 |
| Project Rainier (Trn2/Trn3, 30 % inference) | ~300 (of ~1 GW) | 1.15 | 500K | 1.96 M | 2.13 |
| Google TPU Ironwood (60 % inference) | ~600 (of ~1 GW) | 1.10 | 700K | 3.76 M | 4.96 |
| Azure NVIDIA GB200 (50 % inference) | ~150 | 1.20 | 800K | 1.08 M | 1.41 |
| Fluidstack TX/NY (ramping) | ~200 (partial-year) | 1.25 | 700K | 1.10 M | 1.09 |
| Colossus 2 GB200 add-on (ramping June '26) | ~100 (partial) | 1.40 | 1.5M | 0.61 M | 1.18 |
| Fleet total (annualized at end-2026 capacity) | ~1,650 MW | ~1.20 (weighted) | ~660K (weighted) | ~10.9 M | ~13.0 |
| Fleet-wide index: tokens/MWh | ~1.19 B |
In the base case, Anthropic's fleet-wide inference tokens/MWh ≈ 1.0–1.3 billion — about 2–3× better than Colossus alone, because Trainium3, Ironwood TPU, and GB200/Blackwell capacity carry much better tokens/MWh than the Hopper-heavy Colossus 1 anchor.
6c. Sensitivity (one-variable swings on the fleet number)
| Variable | Swing | Fleet index change |
|---|---|---|
| Token volume (1.5T vs 5T tokens/day) | ±50 % | Index moves linearly (no effect on per-MWh ratio if both numerator and capacity scale together; affects only the absolute tokens/year) |
| Utilization (50 % → 80 %) | +60 % | Index unchanged in tokens/MWh (both numerator and MWh scale) — but absolute energy spend rises 60 % |
| PUE (Colossus 1.30 → 1.50) | +15 % MWh at Colossus | Fleet index −2 % |
| Chip mix shift (1 GW Hopper-replaced-by-Blackwell) | +3× on that GW | Fleet index +25–35 % |
| Model-mix shift to Haiku (Haiku is ~3× cheaper per token & roughly 2–2.5× cheaper energetically) | Haiku share 25 % → 50 % | Fleet index +15–20 % |
The single biggest lever to improve the index is hardware refresh from Hopper to Blackwell/Trainium3/Ironwood (~3× per-MW gain on MoE-FP4 workloads, ~2× on dense FP8), not model-mix optimization or PUE improvements.
7. What This Number Means
A tokens/MWh in the 0.3–1.3 billion range translates to roughly 3–10 Joules per token at the facility level for Anthropic in mid-2026. That is consistent with:
- Best-case academic benchmarks (~0.39 J/token on Llama-3 70B FP8 on H100 at the GPU level — multiply by PUE 1.4 and 1/utilization ≈ 1.5 to get ~0.9 J/token facility-level under ideal conditions; real production is messier by 3–10×)
- SemiAnalysis InferenceMAX figures of 900K tokens/s/MW (H100 MoE FP4) ≈ 1.1 J/token IT-level → 1.5–1.7 J/token facility under ideal conditions
So Anthropic's realized tokens/MWh is roughly 3–10× worse than the marketing best-case chip benchmarks, which is exactly what one would expect given real-world utilization, batch-size constraints from latency SLAs, long context windows (which hurt energy per token super-linearly), and Memphis PUE overhead.
Recommendations
For Anthropic (or an analyst tracking Anthropic) to make this index actionable:
- Stage 1 (now, with public data only): Publish the fleet-wide tokens/MWh annually with a low/mid/high band. Use the Colossus deal as the named anchor, since it is the cleanest inference-only, single-site disclosure. Update quarterly as new capacity comes online and as model-mix shifts.
- Stage 2 (within 6 months — requires modest internal data): Disclose at minimum:
- Total inference tokens served per quarter (input + output, unweighted, per the framework here)
- PUE for each major site (Anthropic-controlled Fluidstack builds; ask AWS, Google, SpaceX for site PUE under NDA)
- Inference-share of capacity by partner This collapses the low/mid/high band by roughly 3×.
- Stage 3 (12–18 months): Add per-model breakdowns (Haiku/Sonnet/Opus) and input-vs-output decomposition. Move from facility-MWh to delivered-MWh (i.e., subtract retired/idle racks). Add a marginal tokens/MWh metric for newly added capacity, which is the right number to use for capacity-planning decisions.
Decision thresholds (what would change the recommendation):
- If Anthropic discloses tokens-served officially → drop the revenue-derived numerator and recompute, expecting a ±30 % revision.
- If Colossus 1 is confirmed to have PUE > 1.5 (plausible given gas-turbine cooling-equipment ratio) → downgrade Colossus's share contribution by ~10 %.
- If the 90-day exit clause is exercised before May 2029 (plausible if Anthropic's Project Rainier and Google Ironwood capacity ramp faster than expected) → re-anchor the index on Project Rainier's Indiana cluster, which would improve the headline number by roughly 2× because of Trainium3 efficiency and AWS's 1.15 PUE.
Caveats
- The numerator is the weakest link. No public Anthropic disclosure of tokens served exists. Our revenue-derived estimate uses a blended price assumption that could be off by 50 % in either direction, especially given the discount stack (90 % prompt caching, 50 % batch, prompt caching adoption rate, subscription bundling).
- InferenceMAX figures are "all-in utility MW" but not full facility PUE. SemiAnalysis benchmarks include server-level overhead but not data-center cooling. We have layered PUE on top, but if InferenceMAX already includes some cooling allocation, our denominator could be slightly double-counted (effect: ~5–10 % overstatement of MWh, understatement of the index).
- Colossus 1 PUE is not public. Our 1.30–1.50 range is a reasoned estimate based on facility design and climate; the true value could be outside this range. The on-site gas-turbine power generation is also a separate issue from facility PUE — if one cares about primary energy per token rather than delivered electricity per token (e.g., for sustainability reporting), the gas-turbine generation loss (~55–60 %) roughly doubles the energy denominator.
- Heterogeneity across sites is glossed over. Real Colossus 1 has a mix of H100, H200, and a small GB200 layer with very different per-watt characteristics; we used a single weighted average. A more sophisticated version would model each accelerator family separately.
- Inference workloads vary enormously in energy per token. Long-context Claude Code requests (50 k+ input tokens with extended thinking) cost super-linearly more energy than short Haiku classification calls. The single index hides a wide internal distribution; per-model breakdowns are a top expansion priority.
- The framework excludes training, per user spec, but Anthropic's reported compute allocation suggests training is still ~35–50 % of total electricity. A future expansion should produce a "total compute" tokens/MWh that includes training-amortized tokens (training tokens / amortization horizon), which is the right number for true sustainability accounting.
- The index is not directly comparable to OpenAI, Google, or Meta without normalizing for model architecture, latency SLAs, and context-window distribution. Peer comparison is a Stage-3 addition, not a launch feature.
- All forward-looking capacity numbers (5 GW AWS, 1 M TPUs, etc.) are commitments, not deployed capacity as of May 2026. The fleet-wide index will move materially as 2026 progresses.
Why the index matters anyway, despite uncertainty: Even at ±50 % precision, tokens/MWh is the right unit-economics metric for AI inference. It is the AI-era analog of "miles per gallon": it lets management track whether the gen-on-gen hardware refresh and software optimization are actually being passed through to a per-token energy improvement (Jensen Huang's claim of 90 % more tokens per GPU year-over-year through software alone should show up here), it lets sustainability teams report a defensible number to investors and regulators, and — most importantly — it lets capacity planners decide whether the next gigawatt should go to more Colossus-class Hopper inference or to Trainium3/Ironwood/Blackwell-class facilities where each MWh delivers 2–3× more output.