Llm May 4, 2026

Harvard-led Science study finds OpenAI o1 beat physicians on ER diagnosis

A Harvard-led study published this week in Science found something that would have sounded far-fetched not long ago: OpenAI’s o1 model produced more accurate emergency-room diagnoses than two human physicians on a real hospital case set. That res...

Harvard-led Science study finds OpenAI o1 beat physicians on ER diagnosis

Harvard’s ER diagnosis study shows where medical AI is getting dangerous to dismiss

A Harvard-led study published this week in Science found something that would have sounded far-fetched not long ago: OpenAI’s o1 model produced more accurate emergency-room diagnoses than two human physicians on a real hospital case set.

That result is real. It also needs context.

The study examined 76 patients who came through Beth Israel’s emergency department. Researchers compared diagnoses from two internal medicine attending physicians with diagnoses generated by OpenAI’s o1 and 4o models. Two other attending physicians, blinded to the source, scored the answers. At the earliest decision point, ER triage, o1 gave the exact or very close diagnosis in 67% of cases. The two physicians scored 55% and 50%.

Those numbers are striking. They’re also narrower than the headlines suggest.

The model wasn’t standing at a bedside doing emergency medicine. It got text pulled from the electronic medical record at each stage of care. No imaging. No physical exam. No visual cues. No sense of how sick someone looks when they try to answer a simple question. And the human comparison matters, because these were internal medicine attendings, not emergency medicine attendings.

Still, this paper deserves attention from people building AI systems, clinical software, or hospital tooling. It doesn’t prove that AI can practice medicine. It does show that for structured, text-heavy diagnostic reasoning under uncertainty, frontier models have reached a level where brushing them off looks naive.

What stands out technically

One of the most important details is also one of the least flashy. The researchers say they did not pre-process the data before feeding it to the models. The AI got the same information available in the EMR at the time of diagnosis.

That matters. A lot of medical AI demos quietly make life easier for the model. They clean up notes, summarize findings, strip out noise, or turn ugly charts into tidy prompts. Fine for product work, less fine for benchmarking. This setup is closer to the real thing: give the model the chart as clinicians actually encounter it.

And o1 did best when the information was thinnest and the pressure highest, at first-touch triage. That tracks with what plenty of engineers have already seen in other domains. Reasoning models tend to do well when they have to synthesize incomplete evidence, hold several hypotheses open, and avoid settling too quickly on the first plausible answer.

That’s a specific capability. It isn’t magic. It’s probabilistic inference over a huge training distribution, with stronger reasoning scaffolding than earlier chat models. In practice, though, for text-centered clinical reasoning, it can start to resemble expert differential diagnosis.

OpenAI’s 4o was part of the comparison too, but o1 was the standout. That’s useful for builders. The gap between a general multimodal assistant and a reasoning-tuned model still matters. If your product depends on high-stakes inference from messy records, model choice isn’t cosmetic.

Why the result matters, and where it falls short

Some of the criticism has been fair.

Emergency physician Kristen Panthagani noted that the study compares AI output with internal medicine attendings, not practicing ER attendings. That’s a real limitation. Specialty matters. Triage is its own skill, and emergency care often depends less on landing the final diagnosis than on spotting the dangerous possibilities fast enough to act.

That goes straight to the weakness in a lot of medical AI benchmarks. They reward diagnostic neatness. Clinical work often rewards safe prioritization under time pressure.

A model can score well on the eventual diagnosis and still miss the bedside task that matters most. In the ER, the first question is often whether this patient could deteriorate in the next 15 minutes.

So the study is useful, but bounded. It shows something specific.

The authors are fairly restrained about that. They call for prospective trials in real care settings. That’s the right next step, because retrospective chart-based performance leaves the hardest questions unanswered:

  • Does the model change physician decisions in practice?
  • Does it improve outcomes, or just make documentation sound sharper?
  • Does it drive unnecessary testing because clinicians over-trust wide AI differentials?
  • How often does it produce a confident, dangerous miss?
  • Who is accountable when the tool is wrong and the clinician follows it?

None of that is settled.

This looks like a workflow story

If you build AI products, the obvious misread is to treat this as a model-versus-human contest. The more useful framing is workflow.

Hospitals already have a text firehose: intake notes, medication lists, prior encounters, nurse observations, problem lists, consults, lab summaries. Humans are not especially good at integrating all of that consistently under pressure. LLMs are. They can take sprawling text and produce ranked hypotheses fast. That’s where the near-term value is.

A plausible deployment path looks like this:

  1. pull the relevant chart context at triage time
  2. generate a ranked differential with confidence language and cited evidence from the record
  3. flag immediately dangerous possibilities even when they’re low probability
  4. surface the missing data that would reduce uncertainty fastest
  5. log every recommendation for audit and later review

That’s much less glamorous than “AI diagnoses patients.” It’s also far more likely to survive contact with clinical reality.

It fits the current limits too. The study notes that existing foundation models are still weaker on non-text inputs. That matters because a lot of emergency medicine lives outside text. ECG patterns, radiology images, pupil response, skin color, gait, work of breathing. Multimodal systems may narrow that gap, but the strongest evidence today is still centered on language-heavy tasks.

The hard engineering starts after the benchmark

Scoring well on retrospective cases is one thing. Shipping a safe clinical system is another.

The first problem is distribution shift. A model that works well in one institution’s EMR can degrade badly somewhere else because note styles, abbreviations, order sets, patient populations, and documentation habits vary all over the place. Anyone who has shipped healthcare software has run into this. “Same specialty” tells you very little about data consistency.

Then there’s auditability. Rodman, one of the study’s lead authors, told The Guardian there’s no formal accountability framework for AI diagnosis right now. That’s a polite way to put it. In enterprise software, some opacity is often tolerable if the output is useful. In clinical settings, opaque reasoning turns into a governance problem quickly. Hospitals need traceability, escalation rules, version control, incident review, and a way to freeze or roll back models when behavior shifts.

Latency and integration matter too. A triage assistant that takes 45 seconds, needs manual copy-paste, and doesn’t fit cleanly into the EHR is dead on arrival. The practical question is whether it can fit into a chaotic workflow without slowing staff down or increasing legal exposure.

And security and privacy remain ugly. If this runs on live EMR data, you need a serious answer for PHI handling, access controls, retention, logging, vendor boundaries, and incident response. Healthcare buyers have gotten less gullible about LLMs. Not nearly enough, but less.

What model builders should take from it

A few things stand out.

First, reasoning-focused models keep pulling ahead in domains where the input is messy text and the job is synthesizing possibilities. General chat benchmarks tell you less than they used to.

Second, healthcare AI is moving past summarization and into judgment support. Summarizing charts was the easy entry point. Differential diagnosis, triage prioritization, and next-step suggestions are harder, but that’s where the money goes if the tools actually hold up.

Third, evaluation is now the bottleneck. The industry has plenty of model output. What it lacks is credible, specialty-aware, clinically meaningful evaluation. This study is interesting partly because it used blinded physician review on real cases. More teams will copy that setup. Many weaker products will avoid it.

That’s probably healthy. A lot of “doctor copilot” startups have survived on demos that fall apart under proper testing.

The takeaway

The cleanest takeaway is the least dramatic: on text-based diagnostic reasoning from real hospital records, a frontier LLM outperformed experienced physicians in a controlled study.

That should change how technical teams think about medical AI. These systems are no longer limited to polishing notes and speeding up documentation. In some narrow but important settings, they’re becoming better inference tools than the humans currently doing the first pass.

The caution still matters. Better diagnosis scores do not equal safe emergency care. Prospective trials, specialty-matched comparisons, multimodal testing, and accountability rules all still need work.

But for teams building clinical software, this paper moves the bar. Treating LLMs as a convenience feature now looks dated.

Keep going from here

Useful next reads and implementation paths

If this topic connects to a real workflow, these links give you the service path, a proof point, and related articles worth reading next.

Relevant service
AI model evaluation and implementation

Compare models against real workflow needs before wiring them into production systems.

Related proof
Internal docs RAG assistant

How model-backed retrieval reduced internal document search time by 62%.

Related article
How OpenAI's MathGen work led to the o1 reasoning model

OpenAI’s o1 reasoning model makes more sense when you look past the product label and at the system behind it. The key point from reporting on OpenAI’s internal MathGen team is straightforward: it spent years pushing models past pattern-matching and ...

Related article
What OpenAI's GPT-5 API and product redesign mean for developers

OpenAI’s GPT-5 release stands out because the product and the API are finally lining up with how teams actually use these models. The benchmark numbers are good. The bigger shift is in the product design. OpenAI is pulling reasoning controls into one...

Related article
OpenAI o3-pro targets technical teams that need more reliable reasoning

OpenAI has released o3-pro, a higher-end version of its o3 reasoning model. This one is aimed at teams doing real technical work, not chatbot demos. The basic pitch is clear enough. o3-pro is built for tasks where the model needs to work through a pr...