Generative ai June 23, 2026

Why AI coding agents are starting to depend on loops

At Meta’s @Scale conference on Friday, Claude Code creator Boris Cherny was asked whether “loops” are the next AI hype cycle or something real. His answer was blunt: “Yes, they’re for real.” Cherny described a shift that should get the attention of t...

Why AI coding agents are starting to depend on loops

AI coding agents are starting to run in loops, and the bill won’t be subtle

At Meta’s @Scale conference on Friday, Claude Code creator Boris Cherny was asked whether “loops” are the next AI hype cycle or something real.

His answer was blunt: “Yes, they’re for real.”

Cherny described a shift that should get the attention of teams already using coding agents in production workflows. Two years ago, developers wrote most source code directly. Then agents started writing code from human prompts. Now, he said, agents are beginning to prompt other agents that write code, review structure, find duplicated abstractions, and submit pull requests without waiting for a human to start each task.

That sounds like a small workflow change. It’s larger than that.

Software has always had loops. The difference here is the stop condition. It may no longer be a clean Boolean expression like while tests_fail. It may be another model deciding whether progress has been made, whether the codebase can be improved, or whether the task should continue. That’s where the pattern becomes useful, expensive, and occasionally risky.

What Cherny described

Cherny gave a concrete example from his own work. One agent continuously scans for ways to improve code architecture. Another looks for duplicated abstractions that could be unified. They don’t just print suggestions into a chat window. They submit pull requests like other contributors.

Because the codebase keeps changing, those agents don’t really finish. There’s always another refactor candidate, another naming inconsistency, another abstraction that might be pulled up or deleted.

For senior engineers, that framing matters. A looped coding agent is closer to a background maintenance process with commit rights, tool access, and a judgment engine attached.

A typical setup might include:

  • A planning agent that identifies candidate tasks
  • A coding agent that edits files
  • A reviewer agent that checks diffs, tests, style, and scope
  • A summarizer that tracks state across iterations
  • A stop condition, which may be deterministic, model-driven, budget-driven, or human-gated

That last point is where most of the engineering risk sits. Classic loops stop because a condition is met. Agentic loops often stop because another agent says the work is good enough, because a token budget runs out, or because a human finally reviews the queue.

That’s a softer contract.

Ralph Loop and test-time compute

One popular pattern is the so-called Ralph Loop, named after Ralph Wiggum. The idea is almost comically simple: ask the model what it has done, ask whether it has accomplished the goal, then feed that assessment back into the next step.

It’s crude, but it addresses a real problem. Long-running model workflows drift. They forget constraints, chase irrelevant subtasks, or declare victory too early. A loop that forces periodic self-assessment can keep the agent pointed roughly toward the target.

There’s a broader technical theme here: test-time compute.

OpenAI researcher Noam Brown recently argued that modern models can solve many more problems if you spend more compute at inference time. Instead of relying on a single forward pass, you generate multiple attempts, critique them, retry, search, verify, and keep spending until the answer improves.

Coding is unusually well-suited to this pattern because it has external signals. Tests pass or fail. Type checkers complain. Linters catch obvious mistakes. Benchmarks move. Static analysis tools produce warnings. Compared with many knowledge-work tasks, code gives agents something firmer than vibes.

That makes agentic loops attractive for engineering work. A model can propose a change, run pytest, inspect the failure, patch the code, run tests again, and repeat. Add a second model as reviewer and a third as planner, and the setup starts to resemble a junior developer with infinite patience and poor cost awareness.

The upside is real. So is the downside.

Why codebases are natural targets

A lot of software maintenance is incremental and never-ending. That makes it perfect bait for AI loops.

Large codebases accumulate duplicate helpers, stale comments, inconsistent patterns, dead branches, flaky tests, brittle abstractions, and dependency drift. Humans usually notice these problems while doing other work, then decide whether fixing them is worth the interruption. Often, it isn’t.

A background agent doesn’t have that friction. It can scan the repository all day, generate tiny PRs, and attach rationale. In theory, that means fewer cleanup weeks and less architectural entropy.

Some tasks are obvious candidates:

  • Deduplicating utility functions
  • Updating deprecated API usage
  • Adding missing tests around touched code
  • Improving type coverage
  • Finding unused exports or dead files
  • Suggesting smaller modules where files have grown too large
  • Checking dependency upgrades against test suites
  • Running security scans and proposing remediations

The hard part is scope control. “Improve the architecture” sounds productive until an agent rewrites half the service boundary because it spotted a pattern. Senior engineers already know refactoring is risky because behavior hides in weird corners. An AI agent doesn’t automatically understand those social and operational constraints.

A human maintainer may know that a messy module shouldn’t be touched before a product launch. A model sees duplication.

That gap matters.

Continuous agents need explicit guardrails

Teams experimenting with looped agents should treat them like automation with production-adjacent privileges. Automation, not interns. Not magic pair programmers.

The guardrails need to be boring and explicit:

  • Run agents in isolated branches or sandboxes
  • Require human review before merge
  • Put hard caps on token spend and wall-clock runtime
  • Restrict filesystem, network, and secret access
  • Use deterministic checks wherever possible
  • Log prompts, tool calls, diffs, test results, and approvals
  • Track agent-authored changes separately in engineering metrics

The security angle is easy to underplay. A coding agent with repository access, package manager access, and CI credentials can do damage even without malicious intent. Prompt injection through issues, comments, docs, or dependency metadata is still a live problem. If an agent reads untrusted text and can execute tools, that text becomes part of the control surface.

There’s also supply chain risk. An agent tasked with fixing a failing build might upgrade a package, add a transitive dependency, or copy a snippet from somewhere it shouldn’t. Good CI catches some of this. It won’t catch everything.

The best looped systems will combine model judgment with hard gates: tests, policy checks, dependency allowlists, static analysis, secret scanning, and human ownership. If another agent is the only thing deciding whether the work continues, expect strange outcomes.

The cost model gets ugly fast

Loops consume tokens aggressively. Multi-agent loops burn through them even faster.

A single coding request might include repository context, task instructions, tool outputs, diffs, test failures, and multiple rounds of reasoning. Add reviewer agents and summarizers, then run the process continuously, and the cost curve stops looking like chat usage. It starts looking like compute infrastructure.

That’s convenient for model providers. Anthropic, OpenAI, Google, and others sell usage. Persistent agents that keep reading, writing, checking, and retrying are a good business if customers accept the value story.

For engineering leaders, the question is less romantic: does the loop produce changes worth the spend and review burden?

A PR isn’t free because an AI wrote it. Someone still has to review it. CI has to run. Merge conflicts happen. Bugs from overconfident refactors can still reach production if review gets lazy. If agents generate ten mediocre PRs a day, they’ve shifted work from writing code to rejecting noise.

The useful metric won’t be “lines of code generated.” That number was always garbage. Better signals include:

  • Accepted PR rate
  • Defect rate in agent-authored changes
  • Human review time per merged change
  • CI minutes consumed
  • Rollbacks or incidents tied to agent commits
  • Cost per accepted change
  • Time saved on specific maintenance categories

Looped agents may pay for themselves on repetitive modernization work. They may be a money pit on ambiguous architecture tasks.

Non-determinism cuts both ways

Traditional automation is brittle because it follows explicit rules. AI agents are useful because they can work with incomplete instructions and messy context. Loops amplify both properties.

A deterministic script can rename an API across a monorepo if the pattern is clear. An agent can notice that the API shape itself is wrong, propose a cleaner abstraction, update tests, and explain the change. That’s a meaningful jump in capability.

But non-deterministic loops can wander. They may optimize for passing tests while subtly reducing readability. They may “unify” abstractions that were intentionally separate. They may churn code in ways that make git blame less useful. They may open PRs that look plausible but encode a misunderstanding of product behavior.

The hardest cases are the ones where tests pass.

That’s why the best near-term use cases are bounded. Give agents narrow goals with strong validation. Let them fix type errors, migrate known patterns, add missing test coverage around defined modules, or propose dependency updates. Be careful with open-ended architectural improvement unless the review culture is strong and the repository has excellent tests.

What technical teams should do now

Developers don’t need to reorganize their entire workflow around looped agents yet. But they should prepare for them, because the pattern is likely to show up in coding tools quickly.

A few practical steps make sense:

  1. Improve test quality. Agent loops are only as good as their feedback signals. Weak tests give false confidence.

  2. Make contribution boundaries explicit. Ownership files, module docs, coding standards, and architectural decision records help agents stay inside the lines.

  3. Instrument AI work. Track model-generated diffs, review outcomes, and incident rates. Don’t rely on anecdotes.

  4. Start with low-risk loops. Documentation freshness, lint cleanup, test generation, dependency PRs, and dead-code detection are better starting points than core service refactors.

  5. Set budgets. Token caps, CI caps, and daily PR limits should be default settings, not afterthoughts.

Cherny’s point is worth taking seriously because Claude Code is already one of the more credible examples of agentic development tooling. Continuous agent loops are a logical next step for that category.

They’re also a tax on sloppy engineering systems. Repositories without tests, clear ownership, or review discipline won’t magically get safer when autonomous agents start filing patches around the clock. They’ll just fail faster and with better-written commit messages.

The near-term winner probably won’t be the team that lets agents run wild. It’ll be the team that treats them as persistent, metered automation with narrow authority and excellent observability. That version is less exciting. It’s also the one that might survive contact with a real codebase.

Keep going from here

Useful next reads and implementation paths

If this topic connects to a real workflow, these links give you the service path, a proof point, and related articles worth reading next.

Relevant service
AI agents development

Design agentic workflows with tools, guardrails, approvals, and rollout controls.

Related proof
AI support triage automation

How AI-assisted routing cut manual support triage time by 47%.

Related article
CopilotKit raises $27M to build app-native AI agents beyond the chat panel

CopilotKit has raised a $27 million Series A led by Glilot Capital, NFX, and SignalFire. Its argument is simple: a chat panel is a bad interface for a lot of software. A lot of enterprise AI still comes down to "user asks in natural language, model r...

Related article
How startups are wiring AI agents into operations after TechCrunch Disrupt 2025

The most useful part of TechCrunch Disrupt 2025’s debate on “AI hires vs. human hustle” is the framing shift underneath it. A lot of startups are already past the basic question of whether AI can handle early operational work. They’re wiring agents i...

Related article
How Spotify engineers use Claude Code and Honk to stop writing code by hand

Spotify says some of its best developers haven’t written code by hand since December. Normally that would read like stage-managed exec talk. The details make it harder to dismiss. The internal setup, called Honk, lets engineers ask Claude Code from S...