What did METR find about developers and AI coding tools?

METR found that many developers were reluctant to participate in a follow-up study if they had to work without AI coding tools, showing how deeply these tools have entered daily development.

Did METR prove that AI makes developers more productive?

No. METR’s earlier study found that developers believed AI made them faster, but measured task times suggested the opposite in that setting.

Why is measuring AI coding productivity difficult?

AI can make code appear faster at first, but the real costs may show up later in debugging, code review, security issues, incidents, and long-term maintenance.

Generative ai May 30, 2026

METR study finds developers increasingly unwilling to code without AI

Many developers now refuse to work without AI coding tools. That was the awkward finding from METR, the AI research lab that tried to rerun a developer productivity study in early 2026. The original study compared how long open source developers took...

Developers won’t give up AI coding tools, even when the productivity math gets messy

Many developers now refuse to work without AI coding tools.

That was the awkward finding from METR, the AI research lab that tried to rerun a developer productivity study in early 2026. The original study compared how long open source developers took to complete tasks with and without AI help. When METR came back for a follow-up, the design hit a practical wall: developers didn’t want to join if they had to turn AI off, even for a limited set of tasks.

That’s a strong signal about adoption. It also makes productivity much harder to measure.

Developers clearly like Copilot-style assistants, agentic IDEs, and coding agents. The harder question is whether teams are getting durable engineering output from them, or just generating code faster and pushing the cost into review, debugging, incident response, and maintenance.

The evidence is mixed. Some of it looks bad.

Measurement is getting harder

METR’s earlier research was uncomfortable for the pro-AI productivity story. Developers in the study believed AI made them faster. The timing data suggested the opposite. They generated code quickly, then spent extra time correcting mistakes, steering the model, waiting for responses, and untangling output that looked plausible but didn’t quite work.

Anyone who has used an AI assistant on production code will recognize the pattern. The first answer arrives fast. The second and third turns are where the cost shows up.

A model can scaffold a test, write a parser, produce a migration, or sketch a React component in seconds. It can also invent APIs, miss edge cases, ignore domain constraints, or thread a subtle security bug through code that looks ordinary. The cost often doesn’t show up during the first PR review.

METR wanted to know whether newer models and more experienced users changed the result. Fair question. Developers are better at prompting than they were a year ago. Models are better at tool use, repository search, test generation, and multi-file edits. IDE integrations have improved. Agents can run tests, inspect failures, and patch their own output.

But if developers won’t work without AI, the clean control group disappears. Researchers are left with surveys, telemetry, and indirect signals. Those can be useful, but they’re weaker.

METR’s May 2026 survey asked technical employees to self-report their AI productivity gains. Respondents said AI made them roughly twice as valuable to their organizations. Some may be right. Still, self-assessment has always been a poor instrument for measuring software productivity. Developers routinely underestimate review load, coordination costs, and downstream maintenance. AI makes that blind spot worse because the speed is visible and the damage is delayed.

Token counts are a bad proxy

The token-count productivity fad was one of the more predictable mistakes of the current AI cycle.

The idea, often called tokenmaxxing, treats token usage as a signal of productive AI work. More prompts, more completions, more model activity, more productivity. It’s easy to measure, which is exactly why it’s dangerous.

According to the Financial Times, Amazon shut down an internal token-tracking leaderboard called Kirorank after employees gamed it by using AI agents heavily and running up costs. That should surprise nobody who has seen a metric become a target. Developers optimized for the leaderboard rather than the work.

Tokens measure consumption. They don’t measure correctness, maintainability, latency impact, security posture, or whether the generated code solved the right problem. A developer can burn a pile of tokens asking an agent to patch a failing test over and over when a human could have fixed it in five minutes by reading the error message.

There’s also a hard cost issue. Agentic coding loops are expensive because they multiply model calls:

read files
summarize context
propose a patch
run tests
inspect failures
revise the patch
repeat

That loop can be valuable on bounded tasks with good tests. It can also turn into a slot machine for engineering budgets. The model keeps trying, the logs look busy, and the result may still need a senior engineer to step in.

For leaders, the lesson is blunt: AI adoption metrics are not productivity metrics. Active users, accepted completions, and tokens spent tell you whether people are using the tools. They don’t tell you whether software delivery is improving.

Generated code can add maintenance debt

The strongest criticism of AI coding tools isn’t that they always produce bad code. They don’t. Better models can generate competent boilerplate, tests, glue code, scripts, SQL, and UI fragments. In the right context, they save real time.

The problem is variance.

AI-generated code often arrives with the shape of correctness. That makes it easy to accept. Long-lived software is less forgiving. A slightly wrong abstraction can spread. A missing validation check can turn into an incident. A generated helper with unclear ownership can sit untouched until nobody remembers why it exists.

Programmer and author James Shore put the maintenance problem neatly in a viral post: “You write code twice as quick now? Better hope you’ve halved your maintenance costs. Otherwise, you’re screwed. You’re trading a temporary speed increase for permanent indenture.”

That line lands because it matches how many teams actually work. Software teams spend most of their time changing existing systems, not creating fresh files in empty repos. If AI helps produce twice as much code but increases the effort needed to understand, test, and modify that code, the team loses over time.

Some recent claims put numbers on the problem. Aiswarya Sankar, founder and CEO of reliability engineering agent startup Entelligence AI, said companies are spending 44% of their tokens fixing bugs generated by AI. CodeRabbit, which sells code review tooling, said its analysis of open source pull requests found AI produced 1.7x more problems than human-written code.

Those figures deserve skepticism. Both companies sell into the pain they’re describing. Still, the direction matches what many engineers are seeing: AI output needs serious review, especially when it touches state, concurrency, permissions, data migrations, distributed systems, or security-sensitive paths.

Independent work points the same way. Researchers from Singapore Management University published a report in April warning that AI-generated code can introduce long-term maintenance costs into real software projects. That matters because maintainability is where demos collapse. A benchmark task can end when tests pass. A production service has to survive refactors, on-call rotations, dependency upgrades, audits, and the next engineer trying to understand it under pressure.

Agents help, within limits

One obvious answer is to use AI to clean up after AI. Coding agents can write code, run tests, inspect failures, generate reviews, patch bugs, and repeat the loop. Cognition’s Devin is the most visible example of that direction.

Cognition founder and CEO Scott Wu has argued that agents should take on the draining follow-up work around code generation and fixes. That’s a plausible product direction. It’s also bounded by current capability.

Wu has said Devin’s skill level sits somewhere between a junior and mid-level programmer, depending on the task. That’s useful. It’s not a reason to hand an agent broad authority over production systems.

A junior-to-mid-level developer can be productive with the right boundaries. The same rule applies to agents. Give them well-scoped issues, strong tests, clear acceptance criteria, and low-risk parts of the codebase. Don’t ask them to redesign auth flows, invent data consistency models, or make architecture calls across a tangled monolith and three services.

The risk is that organizations treat agentic coding as labor replacement before they have the engineering controls to support it. Agents need guardrails:

deterministic test suites that cover important behavior
static analysis and dependency scanning
secure coding checks
code owners for sensitive paths
review policies that distinguish generated scaffolding from core logic
observability that can catch regressions after merge
cost controls on agent loops and model usage

Without that, teams can scale output faster than they scale judgment. That’s a bad trade.

The best uses are still narrow

AI coding tools work best when the success criteria are clear and the blast radius is small.

They’re good at generating repetitive code from patterns already present in the repo. They can translate a function from one language to another, draft unit tests, explain unfamiliar code, write throwaway scripts, produce SQL variants, or help explore library APIs. They’re also useful as a rubber duck that can inspect logs, suggest hypotheses, and point to likely files.

They struggle when correctness depends on context that isn’t in the prompt or indexed repository. Business rules buried in Slack threads. Compliance requirements. Old production incidents. Performance constraints that only appear under load. Security assumptions stuck in an architecture review document from two years ago.

Senior engineers know this, but tool adoption can blur the distinction. A generated function that passes local tests may still be wrong for the system. A clean-looking refactor may increase p99 latency by adding a database call in a hot path. A migration script may work in staging and lock a table in production.

“AI wrote it” can’t become a review shortcut. In high-risk areas, generated code deserves more suspicious review because the author may not fully understand every line. The reviewer has to reconstruct intent from output, which can be slower than reviewing a colleague’s code when that colleague can explain the trade-offs.

Better measures for engineering leaders

Teams don’t need to ban AI coding tools. That’s unrealistic and probably counterproductive. They do need better instrumentation than vibes and token counts.

Useful measures look closer to software delivery and reliability:

cycle time from issue start to production
PR review time and number of review iterations
defect rate by source, including AI-assisted changes where trackable
escaped bugs and incident frequency
rollback rate
test flakiness and coverage on AI-heavy areas
maintenance churn in generated modules
cloud and model spend per merged change
security findings tied to generated code paths

None of these are perfect. Attribution is messy because developers mix manual work with AI assistance. But imperfect engineering metrics beat self-reported claims that people “feel twice as productive.”

Teams should also classify AI use by risk. Generating a CLI helper is not the same as modifying payment logic. A policy that treats all code equally misses the point. The review burden should depend on the system area, not on whether the code came from a model or a person.

The healthiest pattern is boring and effective: let developers use AI, but require the same ownership standards as any other code. If you merge it, you own it. If it breaks production, “the model suggested it” won’t help in the incident review.

AI is becoming part of the developer baseline

The most important part of the METR update may be behavioral rather than statistical. Developers have adapted quickly. Many now see coding without AI as an artificial handicap, like working without search or autocomplete.

That shift matters. Once a tool becomes part of the baseline, measuring its impact gets harder. The counterfactual disappears. Nobody wants to run a sprint without GitHub search just to prove it saves time.

AI coding tools are heading in that direction. They’re already part of daily work for many engineers, even if the productivity case is still messier than the adoption curve suggests.

The useful question for teams is no longer whether developers will use AI. They will. The question is whether the organization has the review practices, tests, cost controls, and engineering judgment to absorb the output without quietly increasing maintenance debt.

Keep going from here

Useful next reads and implementation paths

If this topic connects to a real workflow, these links give you the service path, a proof point, and related articles worth reading next.

Relevant service

Expert staff augmentation

Add focused AI, data, backend, and product engineering capacity when the roadmap is clear.

Related proof

Embedded AI engineering team extension

How an embedded engineering pod helped ship a delayed automation roadmap.

AI coding tools save time until senior engineers clean up the code

AI coding tools save time until they hand you the cleanup. Senior engineers are doing a lot of that cleanup now. They review shaky diffs, strip out duplicated logic, catch fake dependencies, and fix auth mistakes that look fine in a demo and bad in p...

Rocket.new raises $15M to tackle what AI coding tools miss after the first build

Rocket.new, a startup out of India, has raised a $15 million seed round led by Salesforce Ventures, with Accel and Together Fund also participating. Its pitch is simple enough: plenty of AI coding tools can get you to a flashy first version, then fal...

Reload launches Epic to keep AI coding agents in shared project context

Reload has raised $2.275 million and launched Epic, a product meant to keep AI coding agents working from the same project context over time. That sounds modest. It isn’t. A lot of agent-driven development falls apart for exactly this reason. The fai...