Why do reasoning models produce more tokens?

They output intermediate steps, explanations, and justifications, extending response length.

Can benchmarking costs be reduced without sacrificing quality?

Yes, by sampling fewer runs, pruning unnecessary transcripts, and focusing on final answers or select metrics.

Does longer reasoning output guarantee evaluation accuracy?

No, verbosity can mask errors and isn't a reliable indicator of correct reasoning.

Llm April 10, 2025

Why Reasoning Models Are Making AI Benchmarking More Expensive

AI labs keep releasing models that do better on multi-step math, coding, planning, and tool use. Fine. Testing them now costs a lot more than testing the older straight-to-answer models. That matters. Benchmarking is still one of the few ways to chec...

Reasoning models are making AI benchmarks expensive, and that’s a problem

AI labs keep releasing models that do better on multi-step math, coding, planning, and tool use. Fine. Testing them now costs a lot more than testing the older straight-to-answer models.

That matters. Benchmarking is still one of the few ways to check whether a model actually improved or just got better at sounding convincing. As reasoning models produce longer responses, evaluation bills rise with them. If you're paying per token through an API, the numbers get ugly fast.

The rough math is straightforward. A simple eval prompt might have once produced 10 to 50 output tokens. A reasoning-heavy run can hit 500 to 1,000 tokens, sometimes higher, because the model dumps intermediate steps, explanations, and justification. Multiply that across thousands of prompts, multiple seeds, retries, variants, and vendor comparisons, and "run the eval suite" starts looking like a finance problem.

Why these evals cost more

The first reason is obvious. These models generate more tokens.

The second reason is that the tasks themselves are harder. Older benchmark setups often relied on compact outputs and cheap scoring. A classification label. A short extraction. A constrained multiple-choice answer. You could score that at scale without much fuss.

Reasoning benchmarks look different. They often include:

multi-step math or symbolic problems
code generation and debugging
planning tasks with intermediate state
longer context windows
richer prompts that look more like actual workflows

That changes the cost on both sides. Prompts are larger. Outputs are larger. On paid APIs, both count. An eval pipeline that used to be mostly bookkeeping can turn into a real line item.

There's also a multiplier people tend to understate. Good evaluation usually means repeated runs to smooth variance, compare prompt versions, check regressions across model snapshots, and validate against fresh holdout sets. Reasoning models make every one of those loops pricier.

A bottleneck the field didn't need

Expensive evaluation has a predictable effect. It pushes power toward the groups that can afford it.

Big labs can absorb rising benchmark costs. Universities, startups, open-source teams, and independent researchers often can't. That weakens one of the few external checks the industry still has.

If only a small set of companies can afford thorough evals, they also get to dominate the claims. They publish the charts, set the test conditions, and frame the story about who's ahead. Smaller groups can still build strong systems, but proving it gets harder.

That's bad for reproducibility. It's also bad for engineering culture. AI already has enough leaderboard theater. Making verification expensive gives marketing teams even more space to outrun the evidence.

Longer reasoning traces don't automatically help

There's another problem hiding inside the cost argument. More output doesn't necessarily make an evaluation better.

A model can produce a long reasoning trace and still get the answer wrong. It can generate persuasive filler. It can overfit benchmark style. It can produce tidy-looking steps that say little about how it actually arrived at the answer. Anyone who works with current LLMs has seen how loose the link is between verbosity and reliability.

So benchmarks are getting more expensive partly because models emit more text, even when evaluators don't need most of that text.

For many tasks, what matters is:

final answer correctness
code execution success
pass rate on tests
consistency across runs
tool-use behavior under constraints

The full reasoning transcript can help with debugging. For routine scoring, it's often overkill. Paying for every intermediate sentence during large eval runs is a bad default.

That's why some labs have started separating reasoning tokens from visible answers. Even if a model uses internal scratch space, the benchmark doesn't always need that scratch space exposed.

Treat evals like production systems

A lot of teams still handle model evaluation like a research chore. That was easier to justify when outputs were short and costs were low. It isn't now.

If you're shipping LLM features in production, eval infrastructure needs the same discipline as any other cost-sensitive system. Sampling strategy, caching, instrumentation, budget controls. The usual stuff.

A few practical moves matter.

Use tiered evaluation

Run cheap smoke tests often. Run the expensive reasoning suite when the model change is big enough to justify it.

For example:

every commit: small regression set, deterministic prompts, local or cheap model judge
daily or release candidate: larger benchmark sample
major model or prompt update: full reasoning eval with repeated trials

That won't satisfy every benchmark purist. It will save money and still catch most regressions.

Separate grading from generation

If a task can be scored with exact match, unit tests, execution, or a compact rubric, do that. Don't pay one model to produce a thousand-token answer and then pay another model to judge it unless there's no cleaner option.

Code benchmarks make this obvious. If generated code passes tests in a sandbox, that's strong evidence. If it fails, a polished explanation doesn't add much.

Sample intelligently

Exhaustive runs sound rigorous, but once costs rise, smarter sampling usually wins. Stratify by difficulty. Keep a stable core set for trend tracking. Rotate in fresh examples to limit benchmark overfitting. Spend the expensive runs where uncertainty is highest.

Cache aggressively

Prompt-template changes, deterministic tool outputs, and reused system instructions create easy caching opportunities. Teams still skip them and pay to recompute nearly identical benchmark calls. That's waste.

Open source has an opening

This cost pressure creates a real opening for open-source tooling.

The ecosystem already has parts of the answer: Hugging Face evaluation stacks, EleutherAI-style community measurement, local inference pipelines, and benchmark runners that work without proprietary APIs. None of that fully solves reasoning evals, but it lowers the barrier.

There's also room for shared benchmarking infrastructure. If labs and institutions pool datasets, runners, and compute budgets, independent verification gets cheaper for everyone. Public-interest benchmarking isn't glamorous next to frontier model training. It may matter more over time.

A healthy ecosystem needs institutions that test claims, not just make them.

Local and hybrid evals are going to look better this year

For engineering teams, the practical answer is often hybrid.

Use local models or self-hosted inference for broad regression sweeps. Save cloud reasoning models for narrower, high-value validation. If you only need deep evaluation on 5 percent of your test matrix, stop paying frontier-model prices on the other 95 percent.

That can be as simple as:

if release_candidate and high_risk_change:
run_reasoning_eval(expensive_api_model)
else:
run_regression_eval(local_model_or_rules)

The specifics vary, but the pattern is clear. Spend where uncertainty is expensive. Save where the signal is already good enough.

This also helps with privacy and compliance. If your eval set includes sensitive internal tickets, proprietary code, or regulated data, local execution avoids a separate set of problems.

What technical leads should plan for

If you manage AI products, evaluation costs are becoming part of normal planning. A recurring operating cost, not some odd research expense.

That affects:

model vendor selection
prompt and response design
CI/CD workflows for LLM features
testing cadence
reproducibility standards

It also argues for shorter answers in production when possible. Verbose models are expensive twice: once when serving users, and again when validating updates.

It's easy to assume the big labs will handle this. They won't solve the whole problem for everyone else. Benchmark access shapes who gets to participate in model development, who can challenge published claims, and who can trust the numbers.

Reasoning models may earn the hype in some domains. But if testing them gets too expensive for anyone outside the largest labs, the field gets less scientific and more theatrical. That's a bad trade.

Keep going from here

Useful next reads and implementation paths

If this topic connects to a real workflow, these links give you the service path, a proof point, and related articles worth reading next.

Relevant service

AI model evaluation and implementation

Compare models against real workflow needs before wiring them into production systems.

Related proof

Internal docs RAG assistant

How model-backed retrieval reduced internal document search time by 62%.

OpenAI says AI hallucinations persist because models are rewarded for guessing

OpenAI’s latest research makes a blunt point: large language models keep making things up because the industry still rewards them for guessing. That sounds obvious, but it cuts against how many models are built, tuned, and benchmarked. The standard s...

Harvard-led Science study finds OpenAI o1 beat physicians on ER diagnosis

A Harvard-led study published this week in Science found something that would have sounded far-fetched not long ago: OpenAI’s o1 model produced more accurate emergency-room diagnoses than two human physicians on a real hospital case set. That res...

How OpenAI's MathGen work led to the o1 reasoning model

OpenAI’s o1 reasoning model makes more sense when you look past the product label and at the system behind it. The key point from reporting on OpenAI’s internal MathGen team is straightforward: it spent years pushing models past pattern-matching and ...