Goldman Sachs Tests Devin AI Coding Agent Across Its Engineering Teams
Goldman Sachs is testing Cognition’s AI coding agent Devin inside the bank, and the way it’s talking about the rollout is unusually direct. CIO Marco Argenti told CNBC the firm plans to deploy hundreds of Devin instances alongside its 12,000 human de...
Goldman Sachs Is Piloting Devin, and That Says a Lot About Where AI Coding Tools Are Headed
Goldman Sachs is testing Cognition’s AI coding agent Devin inside the bank, and the way it’s talking about the rollout is unusually direct. CIO Marco Argenti told CNBC the firm plans to deploy hundreds of Devin instances alongside its 12,000 human developers, with room to scale further.
Two parts of that stand out.
Goldman is treating Devin as assigned software labor. Supervised software labor, but still labor.
And this is happening in a place where bad code has real consequences. A bank can live with a flaky internal demo. It can’t live with broken audit trails, mishandled secrets, or hidden risk creeping into trading systems.
So the useful question here is narrower than the usual AI-can-it-code debate. Goldman is testing which kinds of software work can be fenced in tightly enough for an agent to be usable inside a regulated enterprise.
Why Goldman is a useful test case
Lots of companies use AI coding tools now. That alone isn’t interesting. Goldman is, because the constraints are harsh and real.
The firm already had internal developer copilots in 2024. It has the platform engineering budget, the security posture, and the compliance burden to run this kind of test properly. If Devin can hold up there, it has a plausible path into insurance, pharma, defense, and other sectors where governance is part of the job.
Argenti’s description of Devin as a “new employee” is obviously a bit theatrical, but the framing matters. Big companies are starting to position these systems as task-driven agents that can take a ticket, inspect a codebase, write changes, run checks, and hand back a result for review.
That’s a different tool category from chat-based code suggestion. It comes with a different risk profile too.
Devin’s appeal is context
The source material points to Devin v2.1, released in May 2025, with context windows up to 128k tokens. That matters in large enterprise repos, where the hard part is often finding the right place to make a change, understanding adjacent abstractions, and avoiding collateral damage in some neighboring service.
A larger context window doesn’t give the model human-style understanding of software. It does reduce fragmentation. Earlier coding models working with 16k or 32k tokens regularly lost coherence in multi-module projects. They produced code that looked fine on its own and missed how the repository actually fit together.
In enterprise development, that’s the line between a decent assistant and a cleanup machine.
Goldman reportedly wants Devin handling routine coding tasks such as boilerplate and integration scripts. That’s a sensible starting point. Agentic coding tools usually fare better when the task has:
- clear interfaces
- established repo conventions
- fast validation through tests or static checks
- low ambiguity about business intent
They struggle when the work depends on undocumented assumptions, legacy edge cases, or domain logic that mostly lives in people’s heads.
Banks have plenty of that.
The part that matters most: controls
The flashy model details aren’t the main thing here. The guardrails are.
According to the source material, Devin’s workflow inside Goldman includes static analysis, type inference, confidence scoring, execution in ephemeral containers, and RBAC-limited access to repositories and secrets. That stack is exactly what you’d expect in a bank. If you’re introducing an autonomous or semi-autonomous coding system into that environment, the control layer has to be stricter than the model.
A plausible pipeline looks like this:
- Devin ingests a bounded slice of the codebase.
- Pre-processing runs tools like
mypyorESLintto annotate the terrain. - The agent generates code for a specific task.
- The result is validated against tests, syntax checks, and security scans.
- Execution happens in a sandboxed container with temporary credentials.
- A human reviews the output before anything lands in a protected branch.
That’s the part other companies should study. In a regulated setting, the wrappers matter as much as the model.
The confidence score is worth noting too, but it shouldn’t be oversold. A score built from token probability, syntax validity, and test pass rates can help with triage. It can’t tell you whether the code matches the business logic. Models are often most confident when they’re producing something structurally familiar, not when they’re right in context.
And a green test suite can still be misleading if the tests are thin or stale. Plenty of broken code ships through green CI.
This is software automation, not autonomous trading
The “real-time trading automation” angle needs some discipline. Goldman is piloting Devin for software development workflows tied to trading and internal systems. It is not turning an AI agent loose on market execution.
That distinction matters.
Code generation inside a supervised SDLC is governable. You can log prompts, diff outputs, restrict data access, run scans, require review, and preserve an audit trail. Direct trading decisions raise a much harder set of model risk, explainability, market conduct, and operational resilience problems.
Goldman seems to understand that. The setup described so far keeps humans in the approval loop and uses Devin to speed up engineering work around the systems, not to replace the control structure around them.
Expect other firms to draw the same boundary.
Scaling to hundreds of agents gets messy fast
Running one coding agent for a team is manageable. Running hundreds across a large engineering org becomes an infrastructure and governance problem in a hurry.
You need scheduling, isolation, access policies, cost controls, and logs detailed enough to answer ugly questions later. Who asked for this change? What code context did the agent see? Which credentials were injected? Which tests passed? Who approved the diff? Can the output be reproduced?
That’s why the source material’s references to Kubernetes-style orchestration, immutable logging, and zero-trust execution matter. Those are table stakes for scale.
There’s also a plain economics problem. Agents that consume large context windows, make repeated tool calls, and run validation steps aren’t cheap. Spread that across thousands of tasks and the bill gets silly fast unless the workflow is selective. The likely enterprise pattern is tighter routing: send repetitive, testable, high-volume work to agents, and leave architecture, edge cases, and domain-sensitive review to humans.
What developers should take from this
The lazy takeaway is that banks are replacing engineers with AI employees. That’s not what this looks like.
What’s actually happening is a layered workflow where AI handles chunks of implementation and humans stay accountable for intent, review, and integration. That changes where senior engineers spend their time.
If this model holds, strong developers become even more valuable in a few specific areas:
- breaking work into tasks an agent can execute cleanly
- spotting generated code that passes checks but misses the point
- designing repos, APIs, and service boundaries that are easier for humans and agents to work with
- building the guardrails, audit systems, and review loops around the agents
That last part still doesn’t get enough attention. There’s going to be real demand for platform teams that can operationalize coding agents safely. Internal tooling, policy enforcement, provenance tracking, sandboxing, evaluation pipelines. A lot of AI adoption talk is fluff. This part is a real engineering discipline.
The bigger signal
Goldman’s Devin pilot matters less as a product endorsement than as a signal about where enterprise AI coding is going. The market is moving away from generic suggestion boxes and toward managed agents with bounded autonomy, tool access, and audit trails.
That’s probably the next phase.
The winners are unlikely to be the tools with the flashiest demos. They’ll be the ones that survive procurement, security review, and internal platform integration.
For engineering leaders, the question is practical: does your organization have the plumbing to use systems like this without creating a governance mess? Goldman is betting that it does. Most companies still don’t.
Useful next reads and implementation paths
If this topic connects to a real workflow, these links give you the service path, a proof point, and related articles worth reading next.
Add focused AI, data, backend, and product engineering capacity when the roadmap is clear.
How an embedded engineering pod helped ship a delayed automation roadmap.
Macroscope launched this week with an ambitious pitch: connect to your GitHub repo, read the code and the work around it, catch bugs in pull requests, summarize what changed, and answer plain-English questions about the codebase. That covers what wou...
AI coding tools save time until they hand you the cleanup. Senior engineers are doing a lot of that cleanup now. They review shaky diffs, strip out duplicated logic, catch fake dependencies, and fix auth mistakes that look fine in a demo and bad in p...
IBM is adding Anthropic’s Claude models to parts of its software portfolio, starting with an IDE that’s already in limited release with select customers. The two companies are also publishing joint guidance on building and running enterprise AI agent...