Llm June 8, 2026

Enterprise AI teams confront the real cost of token-heavy workflows

Enterprise AI adoption has moved from “try every tool” to “explain this invoice.” That change is showing up in budgets. Uber reportedly burned through its full 2026 AI coding budget by April. Microsoft pulled Claude Code licenses from developers only...

Enterprise AI teams confront the real cost of token-heavy workflows

AI spending has hit the messy part: nobody knows what a token is worth

Enterprise AI adoption has moved from “try every tool” to “explain this invoice.”

That change is showing up in budgets. Uber reportedly burned through its full 2026 AI coding budget by April. Microsoft pulled Claude Code licenses from developers only months after rolling them out. A Priceline employee told TechCrunch that a normal Cursor renewal came back 4x to 5x higher than before.

Model pricing has generally fallen on a per-token basis. The bills still went up.

That’s the math behind the current AI budget panic. Cheaper tokens don’t help much when developers, agents, coding assistants, copilots, chatbots, internal search tools, customer support workflows, and automated analysis jobs are consuming far more of them. The industry spent the past year pushing employees toward frontier models and agentic systems. Finance teams are now asking basic questions that many engineering orgs can’t answer cleanly:

  • Which teams are spending the most?
  • Which workflows justify the cost?
  • Which vendors are overbilling or underreporting?
  • Which models are being used when nobody is watching?

The Linux Foundation plans to launch the Tokenomics Foundation, a standards effort meant to bring some order to AI usage accounting, billing definitions, and efficiency metrics. It’s a FinOps-style push for model consumption, but the unit of measure is harder to reason about than CPU hours, storage, or network egress.

Token spend is becoming an operational discipline.

The token problem is bigger than pricing

A token is a billing unit, a modeling primitive, and an awkward business metric. It can represent a word fragment, punctuation mark, code chunk, or part of a structured prompt. Different models tokenize text differently. Different vendors expose usage differently. Some charge separately for input, output, cached input, tool calls, reasoning tokens, image processing, retrieval, fine-tuning, or context window usage.

Simple comparisons get ugly fast.

One team might see lower per-token pricing from a provider, then spend more overall because the model needs longer prompts, produces verbose outputs, or requires retries. Another team may pay more per token for a stronger model but finish tasks in fewer calls. Agentic workflows make this harder because one user request can trigger dozens of model calls, tool invocations, file reads, code edits, test runs, and follow-up prompts.

The bill now reflects a chain of decisions, not one prompt.

OpenAI enterprise chief Alexander Embiricos told TechCrunch that customer conversations have changed sharply. Six months ago, he said, buyers mostly asked whether the models were good enough. Now they ask about visibility, auditability, token controls, and model efficiency.

That matches what many engineering leaders are seeing internally. Early 2025 was the all-you-can-eat phase. Flat subscriptions made AI coding tools feel almost free at the point of use. Developers learned to keep assistants open all day, ask for rewrites, generate tests, summarize diffs, inspect logs, and hand over larger chunks of work. Then contracts renewed. Usage caps appeared. Pricing moved closer to actual consumption.

The spreadsheet caught up.

Agents made the bill harder to predict

The newest frontier models have made agentic software more useful. Claude Opus 4.5, GPT-5.1, and Gemini 3 Pro reportedly improved enough on long-horizon tasks that companies became more comfortable wiring them into developer workflows.

That’s where consumption can run away.

A coding assistant that answers one question is easy to budget. An agent that scans a repo, builds a plan, edits files, runs tests, reads failures, rewrites code, opens a pull request, and responds to review comments has a very different cost profile. The user sees one task. The system sees many calls.

Those calls often include large prompts: repository context, dependency files, logs, stack traces, API docs, style guides, policy constraints, and previous conversation state. Long context windows make that convenient, but convenience costs money. Large context can also hide sloppy engineering. Passing half the repo into a model because retrieval and summarization are hard may work, but it’s an expensive default.

There’s a latency trade-off too. Routing everything through a cheaper small model may reduce cost but produce worse plans, more retries, and more human cleanup. Sending everything to the strongest model may reduce failed attempts but torch the budget. The practical answer is usually model routing, caching, prompt compression, context pruning, and workflow-specific evaluation.

Most companies don’t have that machinery yet.

Productivity numbers are promising, but muddy

The spending would be easier to defend if the productivity gains were obvious and cleanly measurable. They aren’t.

Faros AI released a two-year study of 20,000 developers in April that found developer output was rising, but bugs and rewrites were rising too. Jellyfish found engineers using the most tokens were about twice as productive as lighter AI users, but they consumed 10x the tokens.

That’s a measurement problem.

If an engineer spends $40,000 on tokens in a month and ships work that directly generates revenue, saves an enterprise contract, or removes weeks of platform toil, the bill may be justified. If the same spend produces brittle code, noisy pull requests, and review burden for senior engineers, the productivity claim falls apart.

Nicholas Arcolano, head of research at Jellyfish, put it plainly to TechCrunch: whether extreme spend pays off depends on the business value of shipped code, which most companies still can’t measure.

That’s the core failure. Engineering orgs can count commits, pull requests, story points, deploys, incidents, review cycles, and now tokens. None of those directly equals value. AI widens the gap because it can increase activity while also increasing waste.

For technical leaders, banning expensive tools outright is usually the wrong move. Treating token burn as proof of innovation is just as bad. Teams need evaluation loops that connect AI usage to outcomes: cycle time, escaped defects, production incidents, support volume, customer-facing delivery, and revenue-linked work where possible.

Otherwise, token accounting becomes theater.

Token FinOps is harder than cloud FinOps

Cloud cost management was already painful. AI cost management may be worse.

J.R. Storment, executive director of the FinOps Foundation under the Linux Foundation, told TechCrunch that tracking cloud costs is a hundreds-of-millions-of-rows-a-month data problem. Token costs, he said, can become a trillions-of-rows-a-month data problem.

That sounds dramatic until you look at the event stream. Each prompt, completion, tool call, embedding request, retrieval step, cache hit, model fallback, agent action, and retry can produce cost data. Multiply that by thousands of employees and automated workflows. Then add multiple vendors with different schemas and pricing terms.

A useful AI cost system needs to capture at least:

  • Model name and version
  • Input, output, cached, and reasoning token counts where available
  • User, team, project, environment, and cost center
  • Application or agent name
  • Tool calls and retrieval operations
  • Latency, error rate, retries, and fallback behavior
  • Prompt and response metadata, with careful redaction
  • Business or engineering outcome signals

That last point matters. Cost data without outcome data leads to crude throttling.

Security and privacy add another layer. Logging prompts for auditability can expose source code, secrets, customer data, legal documents, credentials, or regulated data. A serious implementation needs redaction, retention policies, access controls, and sampling. It also needs a clear answer to whether prompts and completions are stored, hashed, summarized, or dropped.

A sloppy token observability rollout can become a data exposure incident.

A new vendor category is forming fast

The market is responding in the usual way: startups, incumbents, and platform vendors are all adding AI cost controls.

Pure-play companies such as Pay-i are focused on measuring and optimizing generative AI cost and performance. Paid is pushing usage and value-based billing for agent products instead of simple subscriptions. Engineering intelligence vendors like Jellyfish, Waydev, and Faros AI are adding AI agent monitoring so leaders can connect developer tool usage to output and quality.

Existing observability and finance platforms want the same budget. Ramp has moved into AI spend management. Datadog and New Relic now offer features around AI cost tracking, token-level observability, GPU monitoring, and cloud cost management. AWS is expected to introduce enterprise AI financial management features at FinOps X.

The land grab makes sense. The buyer already has observability tools, finance systems, cloud cost tools, and engineering metrics platforms. The strongest products will likely fit into existing procurement, identity, tagging, and reporting workflows.

The limitation is obvious: vendors can show usage more easily than value. Token dashboards can identify runaway spend, but they won’t automatically tell you whether an agent improved software quality or wasted review time. The better tools will correlate AI usage with engineering and product outcomes. The weaker ones will produce colorful charts that finance teams use to cut access.

Model routing is becoming table stakes

One practical response is model routing.

Factory, an enterprise AI agent startup, recently launched a router that automatically chooses a model for each task. The idea is straightforward: don’t send every request to the most expensive frontier model if a smaller model can summarize a file, classify an issue, or draft boilerplate. Save the strongest model for planning, architecture reasoning, complex debugging, or ambiguous tasks.

This pattern is already familiar from OpenRouter-style systems and internal LLM gateways. A routing layer can enforce budgets, apply policies, choose providers, manage fallbacks, cache responses, and collect telemetry. It also gives platform teams a central point for security review, prompt filtering, and cost attribution.

Frontier labs are likely to do more of this themselves. Vitaly Gordon, CEO of Faros AI, told TechCrunch that some Anthropic bills already show spend distributed across models even when users call a higher-end model, because the provider routes parts of the workload to cheaper models such as Sonnet or Haiku.

That can be good engineering. It can also reduce transparency if customers don’t understand which model handled which part of a task. For regulated workloads, vendor-managed routing raises questions about reproducibility, audit trails, and model-specific risk controls. A bank, healthcare company, or defense contractor may care exactly which model processed a prompt.

Routing saves money. It also introduces another layer that needs governance.

Standards could help, if they stay practical

The Tokenomics Foundation plans to define common language and metrics for AI token usage and billing. Its early goals include open standards, specifications, and measures such as cost-per-intelligence and tokens-per-watt, along with metrics for token factory effectiveness and consumption efficiency.

Some of that sounds squishy. “Cost-per-intelligence” will be hard to standardize without turning into benchmark theater. Intelligence varies by task, domain, latency constraint, tool access, context quality, and evaluation method. A model that performs well in one workflow may be wasteful in another.

The useful work is more basic: consistent usage fields, clearer billing definitions, shared schemas, better tagging, and practical ways to compare model consumption across vendors. Enterprises don’t need a grand theory of intelligence to manage runaway bills. They need to know which systems are spending money, why, and whether the output is worth it.

Keep going from here

Useful next reads and implementation paths

If this topic connects to a real workflow, these links give you the service path, a proof point, and related articles worth reading next.

Relevant service
AI automation services

Move enterprise AI from pilots into measured workflows with controls and adoption support.

Related proof
Embedded AI engineering team extension

How a focused pod helped ship a delayed automation roadmap.

Related article
Enterprise AI teams hit the token cost wall

Enterprise AI spending has reached the boring, painful phase: finance wants receipts. Uber reportedly burned through its entire 2026 AI coding budget by April. Microsoft pulled Claude Code licenses from developers only months after giving them access...

Related article
Anthropic acqui-hires Humanloop founders and enterprise LLM tooling team

Anthropic has hired Humanloop’s co-founders, Raza Habib, Peter Hayes, and Jordan Burgess, along with much of the team behind the startup’s enterprise LLM tooling. This is an acqui-hire, not a product acquisition. Humanloop’s assets and IP aren’t part...

Related article
Google makes Gemini 3 Flash the default model across Search, app, and API

Google has moved Gemini 3 Flash into the center of its AI lineup. It's now the default model in the Gemini app, it powers AI Mode in Search, and it's coming to Vertex AI, Gemini Enterprise, the API preview, and Google's Antigravity coding tool. The p...