What is the Brier score?

The Brier score measures the mean squared difference between predicted probabilities and actual outcomes to quantify calibration error.

Why do current benchmarks discourage models from abstaining?

Because they treat non-answers as incorrect, rewarding only correct responses and penalizing any hesitation.

How does Expected Calibration Error (ECE) work?

ECE partitions predictions by confidence levels and computes the gap between average confidence and actual accuracy in each bin.

Llm September 9, 2025

OpenAI says AI hallucinations persist because models are rewarded for guessing

OpenAI’s latest research makes a blunt point: large language models keep making things up because the industry still rewards them for guessing. That sounds obvious, but it cuts against how many models are built, tuned, and benchmarked. The standard s...

OpenAI says AI hallucinations persist because we keep scoring models the wrong way

OpenAI’s latest research makes a blunt point: large language models keep making things up because the industry still rewards them for guessing.

That sounds obvious, but it cuts against how many models are built, tuned, and benchmarked. The standard scoreboard still centers on raw accuracy. Get the answer right, you score a point. Admit uncertainty, you often get nothing. Under that setup, a model that guesses confidently and gets lucky can look better than one that knows when to stop.

For anyone shipping AI into coding tools, internal search, customer support, or analytics workflows, that’s a product problem, not an academic one.

Why the argument holds up

OpenAI ties hallucinations to incentives at two levels.

First, pretraining teaches a model to predict the next token, not to check whether a claim is true. That works well for fluency and pattern completion. It does not produce a truth machine. Ask for a niche fact, like a dissertation title or a date of birth that barely shows up in training data, and the model may give you a polished answer with no grounding behind it.

Second, post-training often pushes models to be helpful, coherent, and responsive. Fine. But in practice, “helpful” often slides into “always answer.” If your reward model prefers polished assertions over hesitation, you’re training the system to treat uncertainty as failure.

Then evals finish the job. Most benchmarks still treat abstaining as a miss. So the model learns the obvious lesson: answer anyway.

OpenAI’s fix is straightforward:

penalize confident wrong answers harder
give credit for calibrated uncertainty
make abstention-aware scoring a primary metric, not a side metric

That matters because current leaderboards hide an ugly trade-off. A model can improve its benchmark score by suppressing uncertainty and guessing more often, even if that makes it less reliable in real use.

Accuracy alone is a weak metric for decision systems

This isn’t new in statistics or machine learning. The LLM world has just been slow to absorb it.

If a system has to act under uncertainty, raw correctness only tells part of the story. You also need to know whether its confidence matches reality. A model that says “95% sure” and is right 60% of the time is badly calibrated. A model that says “I’m not sure” and routes to retrieval may look worse on a benchmark and work better in production.

That’s why OpenAI points to tools like:

Brier score, which measures the error in predicted probabilities
log score or negative log-likelihood, which punishes overconfident mistakes hard
ECE or Expected Calibration Error, which checks whether confidence lines up with observed accuracy
risk-coverage metrics, which show how error changes as the model answers fewer or more questions

These aren’t exotic ideas. They’ve been standard in forecasting, medical decision systems, and selective prediction research for years. AI product teams are paying attention now because hallucinations have moved from funny demo failures to expensive operational failures.

A coding assistant that invents a library method wastes time. A data analysis agent that confidently misstates a metric can poison a dashboard or a decision memo. A legal or healthcare assistant that guesses is worse than useless.

“I don’t know” has to be allowed

That sounds small. It isn’t.

Most product teams still optimize for answer rate because silence feels bad. Nobody wants an assistant that keeps refusing. But there’s a real difference between a dead-end refusal and a useful abstention.

A good uncertainty-aware system does one of three things when confidence drops:

it asks a clarifying question
it retrieves evidence
it hands off to a tool or human

That missing control flow matters. Teams often talk about “reducing hallucinations” as if it’s purely a model problem. A lot of the time it’s a system design problem.

If your app forces the model to answer every prompt in one shot, you’ve already made a bad product decision.

Evals and leaderboards will have to change

Expect more pressure on benchmark culture.

A lot of headline model comparisons are built around single-number rankings. They’re easy to tweet, easy to market, and often misleading. If uncertainty-aware evaluation becomes standard, vendors will have to report something messier and more honest, like:

97% accuracy at 85% coverage
3% ECE after calibration
lower area under the risk-coverage curve

That’s less tidy than “our model scores 89.7 on benchmark X,” but it’s closer to how serious systems are evaluated elsewhere.

This will also make marketing harder. Good. A model that answers every question with confidence is not obviously better than one that declines the hardest 20% and gets the rest right.

There’s an obvious risk on the other side. Some vendors will overcorrect and ship models that abstain too often, especially in enterprise settings where legal and compliance teams already prefer false negatives to false positives. That can wreck usability. Nobody wants an assistant that punts on routine tasks.

So the job is calibration by domain and risk, not maximum abstention.

The signals are already there

One useful part of OpenAI’s argument is that teams don’t need some magical new architecture to start acting on this.

Most modern LLM stacks already expose uncertainty signals you can work with:

token logprobs
entropy over candidate tokens
margin between the top few completions
disagreement across multiple sampled outputs
retrieval confidence or citation overlap
tool verification results, like test execution or API checks

None of those signals is perfect on its own. Token probability is a rough proxy, not proof of truth. Self-consistency can still converge on the same wrong answer. Retrieval scores can look solid even when the source corpus is stale or noisy.

Still, combined well, they’re good enough to drive gating logic.

For a coding assistant, that might mean retrieving docs or running tests before answering when the model looks uncertain about an API call. For an analytics copilot, it might mean asking a follow-up or inspecting warehouse metadata before answering a query that depends on schema assumptions. For customer support, it might mean requiring citations from the internal KB before giving policy answers.

That’s where this lands for builders: stop forcing a generator to act like it knows things it doesn’t.

Training will have to follow

The eval change matters because evals become training targets.

If the benchmark rewards calibrated uncertainty, RLHF and RLAIF pipelines will start rewarding it too. Preference labels can explicitly favor:

honest uncertainty
asking for missing context
using retrieval before answering
refusing unsupported factual claims
verifying outputs with tools when available

That will probably produce assistants that feel a little less smooth at first. More caveats. More follow-up questions. More pauses for evidence.

In high-risk domains, that’s a feature. In low-risk chat, it may just be annoying. Product teams will need mode switching, not one global behavior. A coding copilot in autocomplete mode should behave differently from an AI assistant helping with financial reporting.

One policy won’t fit all tasks, and one metric won’t either.

The math is familiar. The hard part is cultural.

Benchmarks like simplicity. Product teams like completion rates. Execs like a single score. Users often say they want certainty even when the certainty is fake. All of that pushes models toward over-answering.

So OpenAI is probably right about the diagnosis. Whether the industry follows through is a different question.

Publishing a paper on calibrated uncertainty is easy enough. Accepting a weaker-looking public leaderboard score once you stop rewarding lucky guesses is harder. Redesigning UX around abstention, retrieval, and escalation paths is harder again.

But that’s the work if AI is going to sit inside real business processes.

For developers shipping LLM features now, the near-term playbook is clear:

add an explicit abstain path
score accuracy at different coverage levels
track calibration, not just correctness
punish confident errors in evals
route uncertain cases to retrieval, tools, or humans

If your current benchmark gives the same penalty to “I don’t know” and “here’s a fabricated answer,” your eval is broken. And if your eval is broken, your product incentives probably are too.

Keep going from here

Useful next reads and implementation paths

If this topic connects to a real workflow, these links give you the service path, a proof point, and related articles worth reading next.

Relevant service

AI model evaluation and implementation

Compare models against real workflow needs before wiring them into production systems.

Related proof

Internal docs RAG assistant

How model-backed retrieval reduced internal document search time by 62%.

HumaneBench tests whether chatbots protect user well-being under pressure

A new benchmark called HumaneBench asks a question most AI evals still sidestep: when a user is vulnerable, does the model protect their well-being, or does it drift toward whatever keeps the conversation alive? The early results are rough. Building ...

Why Reasoning Models Are Making AI Benchmarking More Expensive

AI labs keep releasing models that do better on multi-step math, coding, planning, and tool use. Fine. Testing them now costs a lot more than testing the older straight-to-answer models. That matters. Benchmarking is still one of the few ways to chec...

How OpenAI's MathGen work led to the o1 reasoning model

OpenAI’s o1 reasoning model makes more sense when you look past the product label and at the system behind it. The key point from reporting on OpenAI’s internal MathGen team is straightforward: it spent years pushing models past pattern-matching and ...