Enterprise AI teams hit the token cost wall
Enterprise AI spending has reached the boring, painful phase: finance wants receipts. Uber reportedly burned through its entire 2026 AI coding budget by April. Microsoft pulled Claude Code licenses from developers only months after giving them access...
AI coding tools have a token spending problem now
Enterprise AI spending has reached the boring, painful phase: finance wants receipts.
Uber reportedly burned through its entire 2026 AI coding budget by April. Microsoft pulled Claude Code licenses from developers only months after giving them access. A Priceline employee told TechCrunch that a routine Cursor renewal came back 4 to 5 times higher than expected.
That pattern is showing up across large companies. Per-token prices have fallen, but usage has grown faster. Better coding agents, longer context windows, background execution, auto-retries, tool calls, model routing, and “use the best model” mandates have turned token consumption into a serious line item.
The Linux Foundation is stepping into the mess with plans for the Tokenomics Foundation, a standards effort aimed at bringing FinOps-style discipline to AI token usage and billing. The group plans a formal launch in July and is expected to announce more members around FinOps X next week.
The timing fits. Companies spent 2025 pushing employees to adopt AI everywhere. In 2026, many are finding they don’t know who spent what, which models were used, whether the work was worth it, or whether the bill is accurate.
The AI bill has moved past model pricing
A year ago, a lot of AI procurement talk centered on unit prices: input tokens, output tokens, context caching, batch discounts, and the usual vendor spreadsheet games. Those still matter. They’re no longer the main problem.
Consumption behavior is.
Modern coding agents inspect repositories, read issue threads, generate plans, call tools, run tests, parse failures, rewrite patches, summarize logs, and sometimes loop until someone stops them. Each step burns tokens. Some steps call frontier models. Others route to cheaper models. A single “fix this bug” request can turn into a long chain of model calls, file reads, embeddings, retrieval, and generated output.
That’s why cheaper tokens haven’t necessarily produced cheaper AI operations. The denominator changed.
Alexander Embiricos, OpenAI’s head of enterprise, told TechCrunch that customer conversations have shifted away from “is the model good enough?” toward visibility, auditability, token controls, and model efficiency. That matches what many engineering leaders are seeing internally. The proof-of-concept phase was about capability. Production is about unit economics.
The cost curve steepened after late-2025 model releases such as Anthropic’s Claude Opus 4.5, OpenAI’s GPT-5.1, and Google’s Gemini 3 Pro improved agentic workflows. Better agents get used more. They also do more work per request, which can be useful, wasteful, or both.
One reported company ended up with a $500 million Claude bill after failing to set employee usage limits. That’s an outlier, but the mechanics are familiar: enthusiasm from leadership, loose access controls, weak team-level accounting, and no hard stop until the invoice lands.
ROI is still hard to prove
The spending would be easier to defend if the productivity story were clean. It isn’t.
Faros AI released a two-year study of 20,000 developers in April showing that output rose with AI use, but bugs and rewrites rose too. Jellyfish found that engineers who used the most tokens were about twice as productive as lighter AI users, while consuming 10 times as many tokens.
That’s an awkward trade. Twice the apparent productivity for 10 times the token spend might be a bargain if the work ships revenue-generating features faster. It may be a bad deal if the extra output creates review load, test failures, brittle code, or rework two sprints later.
Nicholas Arcolano, head of research at Jellyfish, put the hard part plainly: whether extreme spend pays off depends on the business value of shipped code, and most companies still can’t measure that.
Many AI dashboards are too shallow for this. Token counts by user are useful, but they don’t answer the questions technical leaders actually care about:
- Did this AI-generated code reduce cycle time or push work into review?
- Did defect rates change after heavy agent adoption?
- Are senior engineers using AI for high-value acceleration, or are junior engineers generating noisy diffs?
- Which repositories, services, and teams have the highest cost per merged change?
- Are expensive models handling tasks smaller models could do?
- Are agents stuck in retry loops because tests, permissions, or tool integrations are broken?
A raw token ledger is the start of the audit trail. It’s not the answer.
Token accounting is nastier than cloud accounting
FinOps gave companies a working language for cloud waste: idle instances, overprovisioned databases, untagged resources, storage tiering, reserved capacity, chargebacks. AI spend looks similar from far away, but the telemetry problem is different.
J.R. Storment, executive director of the FinOps Foundation, told TechCrunch that cloud cost tracking is a hundreds-of-millions-of-rows-a-month data problem. Token cost tracking can be a trillions-of-rows-a-month problem.
That scale claim holds up. AI usage produces event streams at a much finer grain:
- prompt tokens
- completion tokens
- cached tokens
- tool-call traces
- model selection decisions
- embeddings and retrieval calls
- agent steps
- retries
- latency and queueing
- GPU or inference provider metadata
- user, team, repository, project, and application tags
Then teams have to normalize it across vendors that don’t expose the same fields or calculate usage the same way. One provider may report cached context differently from another. A model router may accept a request for a premium model but send parts of the work to cheaper models. Enterprise contracts may include volume discounts, credits, commitments, or bundled usage that obscure the real marginal cost.
Priceline is already seeing discrepancies between vendor-reported usage and its own internal data, according to Chris Reed, the company’s senior director of IT finance. Reed compared the situation to telecom expense management and cloud billing, two areas where errors and optimization opportunities became entire industries.
He’s right. New metering systems create billing disputes. AI adds another wrinkle: the thing being metered is abstract. A token is a billing unit, a computational proxy, and a product metric, but it isn’t equivalent across tokenizers, model families, modalities, or tasks.
A thousand tokens in one model do not necessarily buy the same capability, latency, accuracy, or output quality as a thousand tokens in another.
The tool market is already crowded
Once a cost category hurts enough, vendors show up.
Pure-play startups such as Pay-i are trying to help companies track, measure, and optimize generative AI investments. Paid is pushing results-based billing for AI agents, allowing developers to charge users based on actual value rather than a flat subscription.
Engineering management platforms such as Jellyfish, Waydev, and Faros AI are adding AI agent monitoring so companies can connect spend to developer productivity signals. That’s a sensible wedge. Engineering teams are among the heaviest users of AI coding tools, and their work already leaves traces in Git, CI, ticketing systems, incident tools, and code review.
The larger observability and finance platforms want this budget too. Ramp has moved into AI spend management. Datadog and New Relic are adding token-level observability, AI cost monitoring, GPU visibility, and cloud cost features. AWS is expected to introduce new financial management capabilities for enterprise AI spending at FinOps X.
The risk for buyers is dashboard sprawl. Every provider can show a spend chart. Fewer can connect token events to an operational decision: routing a class of requests to a smaller model, cutting off runaway agents, enforcing per-team budgets, or flagging prompts that expose sensitive context.
The strongest tools will sit close to execution, inside the application layer, agent framework, gateway, inference proxy, or developer platform. That’s where routing, caching, context trimming, rate limits, and policy enforcement can happen before the bill grows.
Model routing helps, with sharp edges
One obvious response to token bloat is model routing. Send simple tasks to cheap models. Reserve frontier models for hard reasoning, large refactors, ambiguous debugging, or high-risk customer-facing output.
Factory, an enterprise AI agent startup, recently launched a router that automatically chooses the right model for each task. Vitaly Gordon, CEO of Faros AI, expects frontier labs and model providers to adopt OpenRouter-style optimization more broadly, steering queries toward cheaper models where possible. He said enterprise Claude bills already show spend split across models such as Opus, Sonnet, and Haiku even when users call the premium model.
That can save money. It can also make debugging harder.
If an agent produces a bad patch, an unsafe answer, or a hallucinated citation, teams need to know which model handled which step, with what prompt, context, and tool output. Routing decisions have to be logged. Evaluation data needs to follow the route. Security reviews get more complicated if sensitive code, logs, or customer data may pass through multiple models or inference providers.
There’s also a quality trap. Automated routing tends to optimize for cost and benchmark confidence unless teams define richer policies. A cheaper model may handle 90% of requests well and quietly degrade the 10% that matter most. In coding workflows, that degradation can surface later as subtle design flaws, flaky tests, or maintainability debt.
Routing is useful. Blind routing is a liability.
Why the Tokenomics Foundation matters
The Tokenomics Foundation is trying to define the shared vocabulary this market currently lacks. Its planned work includes standards, specifications, and metrics for AI token usage and billing, plus concepts such as cost-per-intelligence, tokens-per-watt, token factory effectiveness, and consumption efficiency.
Some of those terms sound early and a bit awkward. That’s normal for standards work. FinOps had to turn messy cloud billing into common categories before enterprises could compare tooling, enforce governance, and negotiate contracts intelligently.
AI needs a similar treatment, but the abstractions need scrutiny. Bad standards can freeze bad metrics into procurement checklists. “Cost-per-intelligence,” for example, only means much if the underlying evaluation is tied to task quality, reliability, latency, and business outcome. Otherwise it becomes another vendor-friendly metric that looks precise and says little.
Common definitions would still help. Senior engineers and platform teams need consistent answers to basic questions:
- What counts as input, output, cached, retrieved, and tool-generated tokens?
- How should multi-model agent traces be attributed?
- How do you report usage when a provider routes behind the scenes?
- How should teams compare spend across text, code, image, audio, and multimodal workflows?
- What metadata should be required for audit, privacy, and chargeback?
- How do you measure energy efficiency for inference without hand-wavy estimates?
Without standards, every enterprise ends up building a brittle internal translation layer for vendor bills, gateway logs, and application traces. That’s expensive, and it favors the largest buyers.
What technical leaders should do now
Treat AI usage like production infrastructure, not a perk.
That means setting budgets and limits, but also instrumenting the paths where tokens are created: IDEs, coding agents, internal tools, gateways, application backends, retrieval systems, and model routers. Spend controls need to sit near the work, not only in a monthly finance report.
Useful controls are practical and boring:
- Require team, project, repository, and environment tags.
- Track cost per merged change, ticket, workflow, or customer-facing task where possible.
- Log model routes, prompts, tool calls, retries, and failure modes.
- Set hard limits for runaway agents and background jobs.
- Use smaller models by default for routine work, with clear escalation rules.
- Review high-token users and workflows for quality, not just volume.
- Compare vendor bills against internal telemetry.
- Tie AI spend reviews to engineering outcomes, not adoption targets.
The companies that get this right won’t be the ones with the prettiest token dashboards. They’ll be the ones that can say which AI workflows are worth paying for, which ones are waste, and where the next dollar should go.
Useful next reads and implementation paths
If this topic connects to a real workflow, these links give you the service path, a proof point, and related articles worth reading next.
Compare models against real workflow needs before wiring them into production systems.
How model-backed retrieval reduced internal document search time by 62%.
Anthropic has added weekly rate limits to Claude Code on top of the existing five-hour caps, and for heavy users that changes the product in a meaningful way. The new setup has two quota buckets: - a weekly overall usage limit across models - a model...
Anthropic has cut off new public access to Windsurf, the coding assistant built on Claude. At TC Sessions: AI, Anthropic CSO Jared Kaplan confirmed the shutdown. The reported reason is strategic: OpenAI is rumored to be acquiring Windsurf, and Anthro...
Anthropic is bringing Claude Code into Slack as a research preview. That matters because a lot of engineering work starts in chat long before anyone opens an editor. The pitch is simple. Mention @Claude in a Slack thread, point it at a repo, and the ...