---
title: The Headroom Argument: Why AI Efficiency Means More Compute, Not Less
date: 2026-05-10
description: Efficiency announcements look bearish for AI capex. The opposite is true. Three forces converge on more inference, and Boards should plan accordingly.
author: map[email:mario@mariothomas.com name:Mario Thomas]
canonical: https://mariothomas.com/blog/headroom-argument-ai-efficiency/
---

Updated AI models arrive almost daily, alongside new architectures and efficiency techniques. The instinctive reading is that this is good news for the AI budget, and that the capex commitments hyperscalers are making will turn out to be over-sized for a market becoming dramatically more efficient. That reading is the wrong way around. This article examines why architectural efficiency releases demand rather than reducing it, what every prior era of computing tells us about where the inference market is heading, and how Boards should read efficiency news to fund the right opportunity rather than the wrong budget.

<!--more-->
{{< image3 src="headroom-argument-ai-efficiency" type="photo" alt="A solitary figure stands at the threshold of a vast, cathedral-scale data centre, dwarfed by towering server columns rising into a column of light at the apex — a visual reframe of the headroom that architectural efficiency creates rather than the brake some readers expect (Image generated by ChatGPT 5.4)" width="735" height="413">}}

{{< audio2 src="mp3/headroom-argument-ai-efficiency.mp3" >}}

On 5 May 2026, Subquadratic, a Miami-based research lab, [launched SubQ](https://subq.ai/introducing-subq): a 12-million-token context window on a sub-quadratic sparse-attention architecture that reportedly reduces attention compute by roughly **1,000 times** at full context. Read at face value, the news says inference is about to get an order of magnitude cheaper. The unprecedented capital expenditure commitments hyperscalers have been making over the last twelve months will turn out to have been over-sized, and the frontier needs fewer GPUs from here, not more.

The same week tells against that reading. The day after SubQ launched, [Anthropic](https://www.anthropic.com/news/higher-limits-spacex) announced a partnership at SpaceX's Colossus 1 facility adding **more than 300MW of new capacity, over 220,000 NVIDIA GPUs**, with stated interest in eventually developing multiple gigawatts of orbital AI compute. The demand it is sized for is already on display: Mozilla disclosed that [Anthropic's Claude Mythos Preview](https://hacks.mozilla.org/2026/05/behind-the-scenes-hardening-firefox/) had identified more than **twelve times as many security vulnerabilities** in Firefox as Claude Opus 4.6 had found earlier in the year. Updated frontier models are hungrier than ever, and the supply commitments are being made now.

Both kinds of news end in the same place. A more efficient architecture at equal or better capability releases applications already wanting to exist; a frontier model orders of magnitude more capable does work earlier models could not. Whichever direction the market takes, the inference requirement goes up. For Boards setting AI strategy in the next twelve to twenty-four months, the question is not whether efficiency progress lets the organisation spend less on AI. It is whether the organisation is positioned to do more with what is now possible. The wrong reading is the more available one, and that is precisely why it is worth setting out plainly.

## The wrong reading

When news arrives that AI is becoming more efficient, or that frontier capability is advancing, the immediate instinct is to ask the cost question. Whether the right response is to shrink the AI budget in light of falling unit costs, or to grow it to fund a more compute-hungry tier, the question treats the news as a budget signal. It is not that the question is unimportant. It is second-order, asked first.

The first-order question is different. It asks what is newly possible, and what the organisation should do to capture it. Efficiency news brings within reach the workloads that cost or latency had previously held back. Capability news opens up work no model could attempt before. Both reshape funding, partnerships, and the timelines an AI strategy has to assume. The cost question, asked first, gets in the way of either.

The data already points to which one is load-bearing. [OpenRouter records](https://openrouter.ai/rankings) weekly token throughput across more than four hundred large language models growing approximately fourfold year-on-year, from around 5 trillion tokens per week in April 2025 to over 20 trillion in April 2026. Per-unit cost movements have not bent that trajectory; if anything, the curve has steepened through every efficiency announcement of the last two years.

Within the [Six Board Concerns](/blog/board-ai-governance-priorities/), this question sits across Strategic Alignment and Financial and Operational Impact. The wrong reading collapses it into the latter and loses the former. Boards scaling back AI ambition now because they read efficiency news as a signal of falling future demand are repeating the category error organisations made when cloud economics lowered compute unit costs. Demand subsequently grew far beyond what the unit-cost curve alone implied.

## The three forces

Three forces explain why demand grows faster than per-unit cost falls.

The first force is that frontier models continue to grow in parameter count, modality coverage, and reasoning depth. Reasoning models in particular consume far more inference per query than their predecessors, because the work happens at the response layer rather than at the prompt. Frontier capability remains tightly coupled to compute, and the coupling has not loosened.

The second force is the source of the news. Sparse attention, sub-quadratic scaling, mixture-of-experts, distillation, quantisation, and speculative decoding are all delivering compute reductions at different points along the inference pipeline. SubQ is the most recent instance of a multi-year research direction, not a stand-alone event. Mamba, RWKV, and DeepSeek Sparse Attention have been working in adjacent territory for years. Efficiency progress has become a continuous research direction, not a sequence of isolated events.

The third force is the load-bearing one. The shift is not simply from expensive inference to cheaper inference, but from episodic inference to continuous inference. Persistent organisational memory, always-on copilots, autonomous agents, real-time multimodal monitoring, document-scale reasoning, ambient AI, and continuous machine-to-machine activity are not new ideas waiting to be invented. They are workloads that already want to exist, held below the viability line by cost, latency, or context-window constraints. Sparse attention does not create them. It releases them. The architectural shift from episodic to persistent inference, set out in [*The Inference Migration*](/blog/inference-migration/), is the substrate. Efficiency progress is the accelerant that releases them.

The three forces are sometimes presented as in tension. They are not. The second multiplies the addressable surface area of the first and third, and the result is a market expanding faster than the per-unit cost is falling. [Goldman Sachs](https://www.goldmansachs.com/what-we-do/investment-banking/insights/articles/powering-the-ai-era) put the same conclusion in plainer language: efficiency progress will not offset the rising need for compute as enterprise applications, cloud services, and agentic AI reach mass adoption.

The directional pattern is already in the data. Hyperscaler capex reached approximately **$800 million per day in 2024**, with cumulative investment expected to exceed **$1 trillion by 2027**. [Deloitte's *TMT Predictions 2026*](https://www.deloitte.com/us/en/insights/industry/technology/technology-media-and-telecom-predictions.html) records that inference is on track to account for around **two-thirds of all AI computing power in 2026**. Data centre power demand is projected to rise **+160% by 2030**, with AI workloads driving the surge. None of these curves bends downward in response to architectural efficiency. They bend upward because architectural efficiency unlocks demand the old price point was suppressing.

## The pattern is not new

Computing has done this before. The history of digital infrastructure is a sequence of efficiency gains absorbed by demand growth that outpaces them, often by orders of magnitude. The error is to equate falling cost-per-unit with falling aggregate demand. AI inference is behaving like every prior compute resource — networking, storage, bandwidth, streaming, mobile data. In each case the unit cost fell, the addressable surface expanded, and total spend grew. Reading the unit-cost curve while ignoring the surface curve is the recurring mistake this article is correcting.

In the mainframe era, each generation delivered cost-per-instruction improvements that the market converted into more workloads, not fewer mainframes. In the PC era, Moore's Law produced cost-per-transistor improvements at a steady pace. The result was more devices in more places, not the same number of devices at lower cost. In the cloud era, hyperscaler economics drove cost-per-CPU-hour down by orders of magnitude. The result was a market for computing that had not previously existed at all, and cloud spend has grown every year through every efficiency gain it has absorbed.

This is the dynamic [William Stanley Jevons](https://en.wikipedia.org/wiki/Jevons_paradox) observed in 1865, when more efficient steam engines increased total coal consumption rather than reducing it, because cheaper power expanded the range of economically viable applications. When the cost of using a resource falls, the economically rational quantity to use rises. In practice, demand often expands faster than the per-unit cost falls, because the addressable applications expand with it.

One reading of efficiency progress treats it as a commoditisation story. On this view, the hyperscalers are overbuilding, frontier capability is being commoditised, and the centre of gravity is shifting to lower-cost producers. The economics do not support that reading. Efficiency progress diffuses across the ecosystem, but frontier capability remains a separate investment thesis with its own demand curve. Both are needed. The [AI Sovereignty Trilemma](/blog/ai-sovereignty-board-trilemma/) frames why: economical compute, sovereign control, and frontier capability are three poles, not three labels for the same thing.

The point is not that compute will become a problem. The point is that the curve of demand is steeper than the curve of efficiency, and Boards should plan accordingly.

## The Board reading

Cost questions belong with the CFO and the procurement function. Capability questions belong on the Board agenda. For a Board with an AI strategy in market, news of either kind is a prompt to revisit the [AI Stages of Adoption](/toolkit/ai-stages-of-adoption/) stage the organisation is operating in. Capability that was prohibitive at Adopting may now be economic at Optimising. Capability that previously required a Transforming-stage commitment may be available without one. The same applies in reverse: the surface area of newly viable applications widens faster than the procurement function can absorb without governance keeping pace.

The capability question becomes more pressing as efficiency lowers the cost of probabilistic decision-making. [*The Reasoning Gap*](/blog/the-reasoning-gap/) set out the bind: the Data (Use and Access) Act 2025's four safeguards apply uniformly to systems whose ability to deliver them is anything but uniform. Efficiency progress widens the bind, because cheaper inference accelerates the deployment of probabilistic systems into the regulated decision space. The watchlist of newly viable applications is therefore also a watchlist of newly regulated decision logic. [Minimum Lovable Governance](/toolkit/minimum-lovable-governance/) answers the design question — explainability engineered at the architectural layer, not bolted on afterwards.

For UK and EU Boards facing energy costs at structural disadvantage to US peers, this is also the mechanism by which the AI Sovereignty Trilemma's economical-compute pole becomes more achievable. Efficiency progress is a strategic gain, not a procurement saving. The capex that suppliers are committing to is sized for demand they can already see, and that the rest of the market cannot yet. Treating efficiency news as licence to reduce capacity commitments, vendor relationships, or AI investment is the response that delivers under-resourcing rather than discipline.

The Boards treating the news as a budget signal will under-resource the opportunity. The Boards treating it as a capability signal will fund what their competitors have not yet realised is possible. The differential is now, while the news is fresh and the strategic implications are still being worked through. The practical Board action follows directly from the diagnostic question. Boards should ask management to bring a revised watchlist of newly viable applications, and the governance capability needed to support them, to the next Board meeting.

## Headroom, not brake

Architectural efficiency announcements are not the brake on AI compute demand. They are the headroom that lets demand grow into the capex commitments already being made, and the catalyst for applications that were already wanting to exist. The investor reading and the director reading converge here. A capability-expansion event, not a cost-reduction event.

There will be more efficiency announcements over the next twelve months. There will be more capacity announcements alongside them. Both are signals of the same underlying trajectory, and they should be read together. Some Boards will read efficiency news as cost news and miss the implication. Others will read it as capability news and act on it.

The compute is going to be needed. The only question is who has already thought through what it is for.

{{< campaign "headroom-argument-ai-efficiency" "hello@mariothomas.com" "Hello" "Let's Continue the Conversation" "Thank you for reading about why architectural efficiency expands AI compute demand rather than reducing it. I'd welcome hearing about how your Board is reading the news — whether you are revisiting the AI Stages of Adoption stage your organisation is operating in as efficiency lowers prior viability barriers, building a watchlist of newly viable applications alongside the governance capability needed to support them, or sizing capacity commitments and vendor relationships against demand that keeps outpacing the unit-cost curve." "Thank you for submitting your details, here's what you provided:" "Click send to share your input with me." >}}