Why is Goldman Sachs treating Devin as supervised software labor?

To ensure AI-generated code meets audit, security, and compliance standards in a regulated environment.

What distinguishes Devin from chat-based code suggestions?

Devin acts as a task-driven agent that takes tickets, inspects code, applies changes, runs checks, and returns results for review.

Why focus on routine tasks like boilerplate generation initially?

Routine tasks have clear interfaces, fast validation, and low ambiguity, making them safer for early agent deployment.

Generative AI July 13, 2025

Goldman Sachs Tests Devin AI Coding Agent Across Its Engineering Teams

Goldman Sachs is testing Cognition’s AI coding agent Devin inside the bank, and the way it’s talking about the rollout is unusually direct. CIO Marco Argenti told CNBC the firm plans to deploy hundreds of Devin instances alongside its 12,000 human de...

Goldman Sachs Is Piloting Devin, and That Says a Lot About Where AI Coding Tools Are Headed

Two parts of that stand out.

Goldman is treating Devin as assigned software labor. Supervised software labor, but still labor.

And this is happening in a place where bad code has real consequences. A bank can live with a flaky internal demo. It can’t live with broken audit trails, mishandled secrets, or hidden risk creeping into trading systems.

So the useful question here is narrower than the usual AI-can-it-code debate. Goldman is testing which kinds of software work can be fenced in tightly enough for an agent to be usable inside a regulated enterprise.

Why Goldman is a useful test case

Lots of companies use AI coding tools now. That alone isn’t interesting. Goldman is, because the constraints are harsh and real.

The firm already had internal developer copilots in 2024. It has the platform engineering budget, the security posture, and the compliance burden to run this kind of test properly. If Devin can hold up there, it has a plausible path into insurance, pharma, defense, and other sectors where governance is part of the job.

Argenti’s description of Devin as a “new employee” is obviously a bit theatrical, but the framing matters. Big companies are starting to position these systems as task-driven agents that can take a ticket, inspect a codebase, write changes, run checks, and hand back a result for review.

That’s a different tool category from chat-based code suggestion. It comes with a different risk profile too.

Devin’s appeal is context

The source material points to Devin v2.1, released in May 2025, with context windows up to 128k tokens. That matters in large enterprise repos, where the hard part is often finding the right place to make a change, understanding adjacent abstractions, and avoiding collateral damage in some neighboring service.

A larger context window doesn’t give the model human-style understanding of software. It does reduce fragmentation. Earlier coding models working with 16k or 32k tokens regularly lost coherence in multi-module projects. They produced code that looked fine on its own and missed how the repository actually fit together.

In enterprise development, that’s the line between a decent assistant and a cleanup machine.

Goldman reportedly wants Devin handling routine coding tasks such as boilerplate and integration scripts. That’s a sensible starting point. Agentic coding tools usually fare better when the task has:

clear interfaces
established repo conventions
fast validation through tests or static checks
low ambiguity about business intent

They struggle when the work depends on undocumented assumptions, legacy edge cases, or domain logic that mostly lives in people’s heads.

Banks have plenty of that.

The part that matters most: controls

The flashy model details aren’t the main thing here. The guardrails are.

According to the source material, Devin’s workflow inside Goldman includes static analysis, type inference, confidence scoring, execution in ephemeral containers, and RBAC-limited access to repositories and secrets. That stack is exactly what you’d expect in a bank. If you’re introducing an autonomous or semi-autonomous coding system into that environment, the control layer has to be stricter than the model.

A plausible pipeline looks like this:

Devin ingests a bounded slice of the codebase.
Pre-processing runs tools like mypy or ESLint to annotate the terrain.
The agent generates code for a specific task.
The result is validated against tests, syntax checks, and security scans.
Execution happens in a sandboxed container with temporary credentials.
A human reviews the output before anything lands in a protected branch.

That’s the part other companies should study. In a regulated setting, the wrappers matter as much as the model.

The confidence score is worth noting too, but it shouldn’t be oversold. A score built from token probability, syntax validity, and test pass rates can help with triage. It can’t tell you whether the code matches the business logic. Models are often most confident when they’re producing something structurally familiar, not when they’re right in context.

And a green test suite can still be misleading if the tests are thin or stale. Plenty of broken code ships through green CI.

This is software automation, not autonomous trading

The “real-time trading automation” angle needs some discipline. Goldman is piloting Devin for software development workflows tied to trading and internal systems. It is not turning an AI agent loose on market execution.

That distinction matters.

Code generation inside a supervised SDLC is governable. You can log prompts, diff outputs, restrict data access, run scans, require review, and preserve an audit trail. Direct trading decisions raise a much harder set of model risk, explainability, market conduct, and operational resilience problems.

Goldman seems to understand that. The setup described so far keeps humans in the approval loop and uses Devin to speed up engineering work around the systems, not to replace the control structure around them.

Expect other firms to draw the same boundary.

Scaling to hundreds of agents gets messy fast

Running one coding agent for a team is manageable. Running hundreds across a large engineering org becomes an infrastructure and governance problem in a hurry.

You need scheduling, isolation, access policies, cost controls, and logs detailed enough to answer ugly questions later. Who asked for this change? What code context did the agent see? Which credentials were injected? Which tests passed? Who approved the diff? Can the output be reproduced?

That’s why the source material’s references to Kubernetes-style orchestration, immutable logging, and zero-trust execution matter. Those are table stakes for scale.

There’s also a plain economics problem. Agents that consume large context windows, make repeated tool calls, and run validation steps aren’t cheap. Spread that across thousands of tasks and the bill gets silly fast unless the workflow is selective. The likely enterprise pattern is tighter routing: send repetitive, testable, high-volume work to agents, and leave architecture, edge cases, and domain-sensitive review to humans.

What developers should take from this

The lazy takeaway is that banks are replacing engineers with AI employees. That’s not what this looks like.

What’s actually happening is a layered workflow where AI handles chunks of implementation and humans stay accountable for intent, review, and integration. That changes where senior engineers spend their time.

If this model holds, strong developers become even more valuable in a few specific areas:

breaking work into tasks an agent can execute cleanly
spotting generated code that passes checks but misses the point
designing repos, APIs, and service boundaries that are easier for humans and agents to work with
building the guardrails, audit systems, and review loops around the agents

That last part still doesn’t get enough attention. There’s going to be real demand for platform teams that can operationalize coding agents safely. Internal tooling, policy enforcement, provenance tracking, sandboxing, evaluation pipelines. A lot of AI adoption talk is fluff. This part is a real engineering discipline.

The bigger signal

Goldman’s Devin pilot matters less as a product endorsement than as a signal about where enterprise AI coding is going. The market is moving away from generic suggestion boxes and toward managed agents with bounded autonomy, tool access, and audit trails.

That’s probably the next phase.

The winners are unlikely to be the tools with the flashiest demos. They’ll be the ones that survive procurement, security review, and internal platform integration.

For engineering leaders, the question is practical: does your organization have the plumbing to use systems like this without creating a governance mess? Goldman is betting that it does. Most companies still don’t.

Keep going from here

Useful next reads and implementation paths

If this topic connects to a real workflow, these links give you the service path, a proof point, and related articles worth reading next.

Relevant service

Expert staff augmentation

Add focused AI, data, backend, and product engineering capacity when the roadmap is clear.

Related proof

Embedded AI engineering team extension

How an embedded engineering pod helped ship a delayed automation roadmap.

Cognition’s Scott Wu draws a line between AI coding tasks and jobs

Cognition CEO Scott Wu is trying to hold a line every AI coding company now has to hold carefully: AI agents should take over software tasks, but not software jobs. That line became harder to defend after Cognition raised $1 billion at a reported $26...

Datadog veterans launch Niteshift to challenge AI coding lock-in

Niteshift, a new AI coding agent startup founded by two early Datadog engineers, has raised a $7 million seed round led by Greylock’s Jerry Chen. The round is modest next to the giant funding rounds now attached to AI coding companies, but the invest...

Macroscope launches as an AI codebase assistant for GitHub and pull requests

Macroscope launched this week with an ambitious pitch: connect to your GitHub repo, read the code and the work around it, catch bugs in pull requests, summarize what changed, and answer plain-English questions about the codebase. That covers what wou...