THE SWITCH THAT HASN'T BEEN FLIPPED
Why most token budgets are not going to age well
Let’s talk token budgets.
There is a fundamental miscalculation going on in budget conversations right now. The old methods for predicting costs won’t hold up. In this article I highlight a few of those estimate miscalculations and propose a technical fix.
It’s on everyone’s minds right now. On May 14th, Microsoft cancelled thousands of internal Claude Code licenses after engineers used the tool so enthusiastically that the bill hit between $500 and $2,000 per engineer per month. Uber had the same experience at larger scale, burning through a $3.4 billion annual AI budget in four months. Read the full story here.
I’ve been in a number of token budget conversations myself lately. Sometimes in the form of cutting back access to users. Other times in the form of laying off some people to afford tokens for others. Neither of these sit well, but cost vs function is a real problem businesses have to solve.
Then one night, while working on a side project, I caught myself doing something that would void all the careful work going into those budget calculations. I was instinctively switching between Opus and Sonnet. Opus for the part of the work that needed real thinking, Sonnet for routine drafting, and honestly I should be using Haiku more for the simple stuff but don’t. This process is nothing unusual. I do it all the time. If you’re a power user of any of these tools with a slight concern for budget, you probably do too.
But that behavior breaks the spreadsheet. Not because the math is wrong, but because the math assumes an average token usage per user and that tokens have a predictable cost. The moment any meaningful percentage of your team switches models based on budget considerations, the projections become a fiction. And if nobody switches at all (if everyone just uses the most powerful model for everything, the way Uber’s engineers did) the bill compounds until someone upstairs notices.
This struck me as interesting enough that I wanted to think about it. Then it struck me as interesting enough that I wanted to build something to prove a theory. This is a two part article: the first is about the problem, the second is about what I built to test it.
WHAT I WAS ACTUALLY DOING
There isn’t a name for the manual switching most of us do, but there should be. It’s brokering: triaging tasks across a tiered stack of available compute based on capability, quality requirements, and cost. The choice is instinctive but still guesswork at this point, powered by months of noticing which model succeeds, wariness about costs, and some superstition and feeling about which model will give me the best output.
But this problem becomes interesting when you start crossing system boundaries. Switching between Opus and Sonnet inside Claude is one thing. At scale it matters, but you’re still routing within one provider’s pricing structure. The real economic shift starts when you switch between cloud services or even to a locally running model; going from one provider’s API to another’s, or from a metered cloud call to hardware you already own. That’s where the cost differential becomes structural rather than marginal.
The infrastructure to do this automatically already exists. The economics work. The hardware exists at every scale. But the thing that’s missing is one specific piece of software, an agent that does automatically what I was doing by hand. Call it a Compute Broker.
The Compute Broker. An agent that routes tasks to the most appropriate model based on two simultaneous criteria: the model’s capability to succeed at the specific task, and the current cost of the compute required to run it. It acts as a broker for compute resources, shopping the available stack and making the best match continuously as both task requirements and market prices shift.
This doesn’t exist as a product yet. All the pieces are sitting on the table waiting for someone to assemble them.
THE COMPUTE SPECTRUM IS ALREADY POPULATED
Most budget conversations assume a stable architecture: cloud APIs in, invoices out. That model is reasonable if cloud is the only place compute lives.
The available compute now runs across a full spectrum, all of it shipping:
Personal node: A laptop or mini PC with a modern NPU runs capable local models around the clock for the cost of electricity. AMD’s Ryzen AI Halo, announced at CES 2026, raises the ceiling considerably: 200B parameter models, 128GB unified memory, LM Studio pre-optimized, landing Q2. This tier just became enterprise-serious.
Office scale: A small shared GPU server for a ten-person team pays for itself in three to five months versus equivalent cloud API spend. The hardware is commodity, the ROI case is documented, and every major server vendor is shipping.
Enterprise edge: Lenovo, Dell, Supermicro, and Cisco are all shipping purpose-built AI inferencing servers. Lenovo alone launched three at CES 2026, from full-sized LLM servers for manufacturing and healthcare to compact edge units for retail. This is a shipping product category with multiple established competitors.
Distributed edge: Span and Nvidia are installing mini data center nodes on residential homes and small businesses via a project called XFRA: 16 Blackwell GPUs per node, drawing on unused electrical capacity already in the building. First deployments late 2026. XFRA is one example of a broader pattern: variable-scale compute is appearing everywhere simultaneously.
Looking a little further into 2027, the question of where your organization’s AI workloads run is becoming a brokering question. It will no longer be a vendor question.
The Compute Spectrum, All of It Shipping Now:
Personal nodes: always-on mini PCs and AMD Ryzen AI Halo, one user, cost of electricity.
Office servers: shared GPU inference for small teams, ROI positive in months.
Enterprise edge racks: purpose-built inferencing from Lenovo, Dell, Supermicro, Cisco.
Distributed edge networks: coordinated neighborhood-scale infrastructure.
Hyperscaler cloud: frontier models, high-judgment tasks, planning-grade reasoning.
THE ORG CHART NO ONE HAS DRAWN
A logic is emerging for how work gets allocated across the spectrum, and it maps to something familiar. Better cloud models become the senior layer: architects, lead reviewers, high-judgment decision-makers handling work where being wrong has significant downstream cost. Local and edge models become the execution layer: volume, first passes, the high-frequency low-stakes work that consumes the bulk of most token budgets. Cloud doesn’t get replaced. It gets promoted.
This isn’t just a cost-cutting story. It’s a delegation story. The same way a senior engineer doesn’t write every line of code, a frontier cloud model shouldn’t be answering every routine query. The economics are already compelling. The capability gap that makes delegation risky today is closing fastest on the analytical tasks, which happens to be where the high-volume routing opportunity lives.
“Cloud doesn’t get replaced. It gets promoted.”
The new thing in this picture is the cost-awareness dimension. A real Compute Broker isn’t just matching tasks to capability tiers. It’s watching live pricing across a heterogeneous compute market and routing each task to the most appropriate and cost-effective source available at that moment. Spot pricing logic applied to AI inference.
DOES THIS IDEA HOLD ANY WEIGHT?
A quick note on context before the numbers, because the Compute Broker’s value proposition differs significantly depending on how you’re paying for AI.
Subscription users (Max, Pro, Team) manage allocation, not per-token cost. The switching behavior described above is real, but the savings are measured in headroom and quality, not dollars per call.
API users pay per token, directly. The prototype’s savings numbers apply here literally: less spend per equivalent workload.
Enterprise users are the most interesting case, and the most broken. The individual engineer may see a token count or usage dashboard, but the cost is externalized. It lands on a budget they don’t control, gets reconciled by a finance team they don’t talk to, and by the time anyone with authority notices, the damage is done. Uber’s engineers used Claude Code at 84-95% adoption rates, hitting $500 to $2,000 per engineer per month in API costs, burning through a $3.4 billion annual AI budget in four months. Nobody routed anything differently because nobody had the incentive or authority to act on what they saw. Microsoft cancelled thousands of Claude Code licenses for the same reason. The Compute Broker matters most for the enterprise user precisely because the routing decision is being made by someone without the budgetary context or incentive to make that call.
A technical solution to an organizational problem sounds like the wrong tool. But the Compute Broker doesn’t rely on changing anyone’s incentives. It moves the routing decision out of the engineer’s hands entirely and into the infrastructure layer, where it can be made consistently and automatically regardless of who’s paying attention to the dashboard.
To find out if this idea holds any merit, I built a simplified prototype. Python, mostly. Haiku as the classifier, dispatching tasks across five tiers: Haiku, Sonnet, Opus, a local Qwen 3.6 27B running on my M4 Max via LM Studio, and Kimi K2.6 hosted via OpenRouter. I ran a hundred prompts across analytical work, writing voice, code generation, current technical knowledge, and multi-step reasoning. Every prompt ran through the broker and through each tier as a baseline.
The headline finding: the broker landed at 54.9% savings versus an Opus baseline ($0.90 vs $2.00 across 100 prompts), while delivering quality I judged equivalent to Opus on the test set. Real numbers, not a projection.
The surprise was Kimi K2.6. I added it to the broker expecting a curiosity and it turned out to be a meaningful piece of the option space. On writing voice prompts where local Qwen fell down, Kimi produced output that matched or beat Sonnet at 37% of Sonnet’s per-token cost. The architecture I’d been thinking about as cloud-versus-local turned out to be a four-tier option space: local open-weight for analytical work, hosted open-weight for voice and prose, premium cloud for production code and current knowledge, frontier cloud for genuinely irreversible decisions. Across 100 prompts the broker split as follows: local handled 32%, Kimi 22%, Sonnet 22%, Haiku 21%, and Opus just 3%, all three of which were genuinely high-stakes distributed systems architecture decisions.
The other thing the prototype taught me was that the classifier is the lever, not the broker’s plumbing. Same code, same model lineup, different routing criteria produced wildly different cost outcomes. The art is in what you tell the classifier to optimize for.
There’s more detail than belongs in this article. The full prototype findings, including some genuinely surprising results about which models got which factual questions wrong, are in a companion piece for anyone who wants to geek out on the data.
REMAINING HURDLES
The prototype Compute Broker works as a routing layer. It demonstrated 54.9% cost reduction across 100 prompts. But as you start getting the benefit of using smarter models to oversee cheaper ones, you introduce two new problems, both of which risk undermining the economics of the Broker for tasks that need oversight.
The first hurdle is the handoff problem. A cloud model overseeing local execution needs to understand what the local model made without re-reading the entire context. Here is where models need to be able to “skim” each other’s output. This is an area with very active study and development. Weavemind’s Weft (weavemind.ai) is one example: a programming language that compiles to a visual graph an oversight model can inspect directly. Not a transcript. A map. While Weft may not be the final solution, it’s an example of progress in solving this problem.
The second is context organization. For this there is also a lot of progress particularly in the “digital brain” pattern which structures a project’s decisions and history into queryable knowledge bases so an oversight model queries a graph rather than reloading raw context. Practitioners report 30% to 60% token reductions when deployed correctly (moksoft.com/blog/ai-agents-token-optimization-obsidian-llm-second-brain).
Both are fundamentally “compression problems”, and compression problems have a predictable improvement curve. The prototype broker works now for simple requests and to test the thesis, but for this to be fully production ready a more sophisticated version will be needed than what I built in a couple days.
WHERE THIS LEAVES THE BUDGET CONVERSATION
Back to where this started. The careful work going into converting story points to token usage isn’t wrong. It’s just resting on an assumption that’s already shifting underfoot, and shifting fastest in the direction of the practitioners who are figuring out manual switching on their own. The realized cost savings from even a prototype broker are larger than most planning processes account for, and the option space the broker navigates is growing faster than any annual cycle can model.
The useful action isn’t to refine your per-token projections. It’s to build flexibility into the plan, watch what your power users are actually doing, and assume the architecture under the calculation will look meaningfully different by the time the budget year arrives.
For now I’m going to keep building. The prototype keeps me in an interesting problem space. Maybe I’ll get it to a point where it’s useful for my own daily work. But the honest answer is that the real solution needs to come from someone with the infrastructure to deploy it at scale, either from Anthropic or OpenAI building it into their platforms directly, or from one of the agent orchestration teams incorporating cost-aware routing into their suite.
For those who want to geek out on the details of the experiment, stay tuned for part two.
Update — June 4, 2026
Two developments in the week since this article published are worth noting.
The first is Factory Router, shipped June 1st by Factory.ai. It’s automatic model selection for enterprise coding workflows, routing tasks across frontier and efficient models to cut token spend 20-25% while maintaining 99% of Opus-level performance on their benchmarks. It’s in private research preview today. The gap this article described — the technology exists, the opinionated product does not — just closed, from exactly the direction predicted: an agent orchestration team incorporating cost-aware routing into their suite. Part two of this series covers the prototype I built to test this idea. The short version: it works, at higher savings than Factory Router’s current numbers, because the option space includes local models and hosted open-weight providers that Factory Router doesn’t yet route to.
The second is NVIDIA RTX Spark, announced at Computex 2026 this week. This isn’t a single device. It’s a full platform: a Windows on Arm architecture powered by NVIDIA’s RTX Spark Superchip delivering 1 petaflop of AI compute and 128GB of unified memory, with ASUS, Microsoft, and other OEMs already building on it. NVIDIA committed to at least two additional generations of Spark chips. The personal compute tier described in this article just became a platform war between AMD and NVIDIA, with multi-generational roadmaps from both. The compute spectrum is not a future prediction. It is a current product category.
Both developments in the same week. The switch is being flipped.
