OpenAI says AI hallucinations persist because models are rewarded for guessing
OpenAI’s latest research makes a blunt point: large language models keep making things up because the industry still rewards them for guessing. That sounds obvious, but it cuts against how many models are built, tuned, and benchmarked. The standard s...
OpenAI says AI hallucinations persist because we keep scoring models the wrong way
OpenAI’s latest research makes a blunt point: large language models keep making things up because the industry still rewards them for guessing.
That sounds obvious, but it cuts against how many models are built, tuned, and benchmarked. The standard scoreboard still centers on raw accuracy. Get the answer right, you score a point. Admit uncertainty, you often get nothing. Under that setup, a model that guesses confidently and gets lucky can look better than one that knows when to stop.
For anyone shipping AI into coding tools, internal search, customer support, or analytics workflows, that’s a product problem, not an academic one.
Why the argument holds up
OpenAI ties hallucinations to incentives at two levels.
First, pretraining teaches a model to predict the next token, not to check whether a claim is true. That works well for fluency and pattern completion. It does not produce a truth machine. Ask for a niche fact, like a dissertation title or a date of birth that barely shows up in training data, and the model may give you a polished answer with no grounding behind it.
Second, post-training often pushes models to be helpful, coherent, and responsive. Fine. But in practice, “helpful” often slides into “always answer.” If your reward model prefers polished assertions over hesitation, you’re training the system to treat uncertainty as failure.
Then evals finish the job. Most benchmarks still treat abstaining as a miss. So the model learns the obvious lesson: answer anyway.
OpenAI’s fix is straightforward:
- penalize confident wrong answers harder
- give credit for calibrated uncertainty
- make abstention-aware scoring a primary metric, not a side metric
That matters because current leaderboards hide an ugly trade-off. A model can improve its benchmark score by suppressing uncertainty and guessing more often, even if that makes it less reliable in real use.
Accuracy alone is a weak metric for decision systems
This isn’t new in statistics or machine learning. The LLM world has just been slow to absorb it.
If a system has to act under uncertainty, raw correctness only tells part of the story. You also need to know whether its confidence matches reality. A model that says “95% sure” and is right 60% of the time is badly calibrated. A model that says “I’m not sure” and routes to retrieval may look worse on a benchmark and work better in production.
That’s why OpenAI points to tools like:
- Brier score, which measures the error in predicted probabilities
- log score or negative log-likelihood, which punishes overconfident mistakes hard
- ECE or Expected Calibration Error, which checks whether confidence lines up with observed accuracy
- risk-coverage metrics, which show how error changes as the model answers fewer or more questions
These aren’t exotic ideas. They’ve been standard in forecasting, medical decision systems, and selective prediction research for years. AI product teams are paying attention now because hallucinations have moved from funny demo failures to expensive operational failures.
A coding assistant that invents a library method wastes time. A data analysis agent that confidently misstates a metric can poison a dashboard or a decision memo. A legal or healthcare assistant that guesses is worse than useless.
“I don’t know” has to be allowed
That sounds small. It isn’t.
Most product teams still optimize for answer rate because silence feels bad. Nobody wants an assistant that keeps refusing. But there’s a real difference between a dead-end refusal and a useful abstention.
A good uncertainty-aware system does one of three things when confidence drops:
- it asks a clarifying question
- it retrieves evidence
- it hands off to a tool or human
That missing control flow matters. Teams often talk about “reducing hallucinations” as if it’s purely a model problem. A lot of the time it’s a system design problem.
If your app forces the model to answer every prompt in one shot, you’ve already made a bad product decision.
Evals and leaderboards will have to change
Expect more pressure on benchmark culture.
A lot of headline model comparisons are built around single-number rankings. They’re easy to tweet, easy to market, and often misleading. If uncertainty-aware evaluation becomes standard, vendors will have to report something messier and more honest, like:
- 97% accuracy at 85% coverage
- 3% ECE after calibration
- lower area under the risk-coverage curve
That’s less tidy than “our model scores 89.7 on benchmark X,” but it’s closer to how serious systems are evaluated elsewhere.
This will also make marketing harder. Good. A model that answers every question with confidence is not obviously better than one that declines the hardest 20% and gets the rest right.
There’s an obvious risk on the other side. Some vendors will overcorrect and ship models that abstain too often, especially in enterprise settings where legal and compliance teams already prefer false negatives to false positives. That can wreck usability. Nobody wants an assistant that punts on routine tasks.
So the job is calibration by domain and risk, not maximum abstention.
The signals are already there
One useful part of OpenAI’s argument is that teams don’t need some magical new architecture to start acting on this.
Most modern LLM stacks already expose uncertainty signals you can work with:
- token logprobs
- entropy over candidate tokens
- margin between the top few completions
- disagreement across multiple sampled outputs
- retrieval confidence or citation overlap
- tool verification results, like test execution or API checks
None of those signals is perfect on its own. Token probability is a rough proxy, not proof of truth. Self-consistency can still converge on the same wrong answer. Retrieval scores can look solid even when the source corpus is stale or noisy.
Still, combined well, they’re good enough to drive gating logic.
For a coding assistant, that might mean retrieving docs or running tests before answering when the model looks uncertain about an API call. For an analytics copilot, it might mean asking a follow-up or inspecting warehouse metadata before answering a query that depends on schema assumptions. For customer support, it might mean requiring citations from the internal KB before giving policy answers.
That’s where this lands for builders: stop forcing a generator to act like it knows things it doesn’t.
Training will have to follow
The eval change matters because evals become training targets.
If the benchmark rewards calibrated uncertainty, RLHF and RLAIF pipelines will start rewarding it too. Preference labels can explicitly favor:
- honest uncertainty
- asking for missing context
- using retrieval before answering
- refusing unsupported factual claims
- verifying outputs with tools when available
That will probably produce assistants that feel a little less smooth at first. More caveats. More follow-up questions. More pauses for evidence.
In high-risk domains, that’s a feature. In low-risk chat, it may just be annoying. Product teams will need mode switching, not one global behavior. A coding copilot in autocomplete mode should behave differently from an AI assistant helping with financial reporting.
One policy won’t fit all tasks, and one metric won’t either.
The math is familiar. The hard part is cultural.
Benchmarks like simplicity. Product teams like completion rates. Execs like a single score. Users often say they want certainty even when the certainty is fake. All of that pushes models toward over-answering.
So OpenAI is probably right about the diagnosis. Whether the industry follows through is a different question.
Publishing a paper on calibrated uncertainty is easy enough. Accepting a weaker-looking public leaderboard score once you stop rewarding lucky guesses is harder. Redesigning UX around abstention, retrieval, and escalation paths is harder again.
But that’s the work if AI is going to sit inside real business processes.
For developers shipping LLM features now, the near-term playbook is clear:
- add an explicit abstain path
- score accuracy at different coverage levels
- track calibration, not just correctness
- punish confident errors in evals
- route uncertain cases to retrieval, tools, or humans
If your current benchmark gives the same penalty to “I don’t know” and “here’s a fabricated answer,” your eval is broken. And if your eval is broken, your product incentives probably are too.
Useful next reads and implementation paths
If this topic connects to a real workflow, these links give you the service path, a proof point, and related articles worth reading next.
Compare models against real workflow needs before wiring them into production systems.
How model-backed retrieval reduced internal document search time by 62%.
A new benchmark called HumaneBench asks a question most AI evals still sidestep: when a user is vulnerable, does the model protect their well-being, or does it drift toward whatever keeps the conversation alive? The early results are rough. Building ...
AI labs keep releasing models that do better on multi-step math, coding, planning, and tool use. Fine. Testing them now costs a lot more than testing the older straight-to-answer models. That matters. Benchmarking is still one of the few ways to chec...
OpenAI’s o1 reasoning model makes more sense when you look past the product label and at the system behind it. The key point from reporting on OpenAI’s internal MathGen team is straightforward: it spent years pushing models past pattern-matching and ...