Why is formal verification critical in AI-generated proofs?

It ensures every proof step type-checks in a proof assistant, eliminating hand-wavy reasoning.

What components make up the AI proof pipeline described?

Literature retrieval, symbolic tooling (e.g., SymPy), LLM proof planning, and proof assistant verification.

Which AI models have driven recent progress?

Systems like AlphaEvolve (Gemini-powered) and GPT-5.2 have led to multiple solved Erdős problems.

Artificial Intelligence January 17, 2026

AI is starting to matter in proofs of hard open math problems

AI-assisted math results used to sound like stunts. That’s getting harder to say. Since Christmas, 15 Erdős-style open problems have reportedly been moved into the solved column, and 11 of those involved AI in some meaningful way. Terence Tao has bee...

AI is starting to solve serious math, and the stack behind it matters more than the headline

AI-assisted math results used to sound like stunts. That’s getting harder to say.

Since Christmas, 15 Erdős-style open problems have reportedly been moved into the solved column, and 11 of those involved AI in some meaningful way. Terence Tao has been tracking the activity publicly. Over the weekend, software engineer and former quant researcher Neel Somani posted a new example: he gave an Erdős-flavored number theory problem to OpenAI’s latest model, let it run for about 15 minutes, and got back a proof that survived formalization with Harmonic’s Aristotle tool.

The important part is the verification. Models have been able to produce plausible-looking math for a while. A proof that can survive a verifier is a different class of result.

That’s the shift worth watching. The model is one piece of a proof pipeline that includes retrieval, symbolic tooling, and formal verification. That stack is now good enough to make progress on nontrivial problems.

What happened

Somani’s example wasn’t a toy problem or an olympiad rerun. The model used tools from number theory such as Legendre’s formula, Bertrand’s postulate, and the Star of David theorem, then assembled them into a proof for a variant tied to Erdős’ work on products of central binomial coefficients. The structure reportedly differed from Noam Elkies’ 2013 MathOverflow solution, which makes it more interesting than retrieval plus paraphrase.

Tao’s running tally gives the broader picture. He’s tracking eight Erdős problems where AI made meaningful autonomous progress, plus six where models found prior work and extended it. A Gemini-powered system called AlphaEvolve seems to have started an earlier wave in November. Then GPT-5.2 arrived, and the pace of solved entries appears to have picked up.

Tao’s summary was blunt:

“Many of these easier Erdős problems are now more likely to be solved by purely AI-based methods than by human or hybrid means.”

That fits the pattern so far. These systems aren’t tearing through deep unsolved mathematics. They are getting good at the long tail of obscure but tractable problems, where the hard part is finding the right prior result, applying it cleanly, and avoiding some small fatal mistake.

That’s still significant.

Why this works now

People keep looking for a single trick behind modern reasoning systems. This looks much more like systems engineering.

The workflow is roughly:

Search relevant literature and forum posts.
Build a candidate proof plan.
Test algebraic or combinatorial steps with symbolic tools.
Translate the argument into something a proof assistant can check.
Repair failed steps until the verifier accepts the result.

Each stage covers for weaknesses elsewhere.

LLMs are good at pattern-matching across huge mathematical corpora. They’re still unreliable when asked to carry a long proof cleanly in free text. So the text generation gets wrapped in guardrails. Retrieval finds useful lemmas. SymPy, SageMath, or another computer algebra system checks symbolic steps. Lean enforces exact definitions and rejects hand-waving. Aristotle helps bridge the awkward gap between natural-language proof writing and formal syntax.

That changes the failure mode.

Without verification, you end up arguing over whether the reasoning looks sound. With formalization, the standard is harder. Either the lemma type-checks and compiles in Lean, or it doesn’t. There can still be problems upstream, especially around theorem framing or imported prior work, but the final proof object is much less slippery than a polished chain of thought.

Anyone building production AI systems should recognize the pattern. Free-form generation is useful. Verified generation is useful in a more operational way.

`GPT-5.2` looks better at the boring hard parts

OpenAI hasn’t published internals that explain this jump, so most of the analysis is behavioral. A few things stand out.

Longer reasoning traces seem more stable. Letting a model “think” for 15 minutes used to mean paying for a wandering answer that often fell apart near the end. Newer systems are holding multi-step plans together better, which suggests improvements in long-context attention, scratchpad management, or both.

The mathematical priors also look sharper. In these reported examples, the models aren’t just citing classical results. They’re selecting the right ones and using them in ways that survive checking.

Retrieval seems less sloppy too. A lot of failed research automation comes down to a simple problem: the model finds something adjacent, treats it as the exact result it needs, and builds on sand. Better systems are getting more precise about whether an old MathOverflow thread already solves the problem, only handles a special case, or offers a tool that can be adapted.

That may sound mundane. It isn’t. Precision in retrieval is the difference between research assistance and expensive nonsense.

The contamination problem is still there

There’s a reason serious mathematicians are separating “autonomous progress” from “found prior research and extended it.”

Training data contamination remains a live issue. If a model has already seen partial solutions, discussion threads, or textbook lemmas that nearly solve the problem, then it may be doing very competent synthesis rather than genuine discovery. That still has value. It’s just a different kind of achievement.

The same distinction matters outside math. A coding agent that stitches together a correct fix from old issues, docs, and past commits is useful. An agent that comes up with a genuinely new algorithmic approach is something else, and much rarer.

Right now, a lot of the visible wins seem to come from stronger search, better composition, and tighter verification. That’s enough to change practice. It doesn’t mean models have turned into independent mathematicians.

Why engineers should care

The immediate takeaway for engineering teams isn’t that AI will solve the Riemann hypothesis. It’s that verified reasoning workflows are starting to look practical.

The same architecture maps onto less glamorous and often more expensive work:

code security analysis backed by reproducible checks
compliance reasoning with citation trails
resource allocation and scheduling logic in regulated systems
scientific and technical literature synthesis where provenance matters
API migration or refactoring plans that need machine-checked invariants

A mature version of this stack looks a lot like a serious CI pipeline for reasoning. Retrieval is externalized and auditable. Intermediate steps get tested. Final claims go through a verifier. If verification fails, the output fails.

That’s a better mental model than “chatbot, but smarter.”

It also helps explain why proof assistants like Lean, along with Coq, Isabelle, Z3, and domain-specific checkers, are getting closer to mainstream engineering stacks. They belong in the toolchain when the cost of being wrong is high.

What to watch

The risk is overreading early technical progress as operational proof. In scientific or health-adjacent settings, reliability, validation, data quality, and expert review matter more than a clean product story. The useful question is where the system reduces friction without weakening accountability.

Keep going from here

Useful next reads and implementation paths

If this topic connects to a real workflow, these links give you the service path, a proof point, and related articles worth reading next.

Relevant service

AI model evaluation and implementation

Compare models against real workflow needs before wiring them into production systems.

Related proof

Internal docs RAG assistant

How model-backed retrieval reduced internal document search time by 62%.

Google launches AI Futures Fund for startups with DeepMind and Labs access

Google has launched the AI Futures Fund, a program for startups building with AI. The offer is straightforward: capital, cloud credits, access to DeepMind and Google Labs tech, and technical support from Google’s research and product teams. The money...

Yann LeCun’s reported Meta exit puts world models at the center of AI

Yann LeCun is reportedly preparing to leave Meta and start a company focused on world models. If that happens, it lands as a management story, a research story, and a product story at the same time. At Meta, LeCun has been the clearest internal criti...

Meta's Scale AI deal is starting to create conflicts of interest

Meta put roughly $14.3 billion into Scale AI in June and brought Scale founder Alexandr Wang, plus several executives, into its new Meta Superintelligence Labs. The logic was clear enough: get closer to a major data supplier, move faster on model dev...