OpenAI Agents SDK adds sandboxed workspaces for safer enterprise agents
OpenAI has updated its Agents SDK with two features enterprise teams have been asking for: sandboxed workspaces and a supported runtime stack for long-running agents. That may sound like plumbing. It is. It’s also the part that usually breaks once an...
OpenAI’s Agents SDK grows up with sandboxed workspaces and a real runtime story
OpenAI has updated its Agents SDK with two features enterprise teams have been asking for: sandboxed workspaces and a supported runtime stack for long-running agents.
That may sound like plumbing. It is. It’s also the part that usually breaks once an agent leaves the demo stage.
Models can already plan, call tools, and work with files well enough to look good in a notebook. The harder questions come right after that. Where does the code run? What can it access? How do you stop it from roaming across the network, leaking data, or trashing a shared environment after 40 steps and a few retries?
This SDK update goes straight at those problems. It’s available for Python now, with TypeScript support planned, and exposed through the API under standard pricing. OpenAI is also previewing code mode and subagents, which points to a broader push toward modular agent systems that can survive production.
Isolation is the useful part
The strongest piece of this release is also the least flashy.
Sandboxed workspaces give each agent or job a bounded execution environment with scoped file access, restricted tools, controlled network egress, and runtime-injected secrets. That fits how enterprise security teams already think about CI jobs, ephemeral containers, and service accounts.
The controls OpenAI is exposing will look familiar:
tool_allowlistto limit which tools an agent can callfs_mountsfor read-only and read-write pathsnet_policiesto define outbound access and deny everything else- short-lived
secretsinjected at runtime instead of hanging around in memory
This is where agent projects usually get painful. Tool use is the value proposition, but it’s also the biggest risk surface. Give a model broad filesystem access and unrestricted HTTP calls, and you’ve built a compliance problem with a prompt interface.
A sandbox doesn’t fix everything. It does improve the default posture, and that matters. Most teams don’t need an agent with unlimited freedom. They need one that can read a repo, write to a temp directory, call two internal APIs, and stop there.
That’s a lot easier to approve.
OpenAI is selling the control plane too
The other big change is what OpenAI calls an in-distribution runtime for advanced models. Bad name. Good idea.
OpenAI is packaging a supported orchestration layer around the model so teams can build, test, and run agents against the same control stack they’ll use in production. In practice, that means the pieces every serious agent system ends up building anyway:
- a planner for breaking goals into steps
- a tool router for dispatch, retries, and timeouts
- a state store for scratchpads and intermediate artifacts
- policy enforcement around allowed actions
- tracing and events for debugging and billing
That matters because agent failures rarely come from one model call. They come from the interaction between planning, state, tool output, retries, partial failure, and weird edge cases after step 27. If development and production handle those pieces differently, you get orchestration drift. A workflow that behaves in testing starts acting strangely under load or after a timeout chain.
OpenAI is trying to remove some of that by owning more of the stack.
There’s an obvious competitive angle. For the past year, plenty of teams have stitched together LangGraph-style orchestration, internal policy checks, telemetry hooks, and some container wrapper nobody wants to maintain. It works, until it doesn’t. A vendor-supported runtime tied closely to frontier models is attractive if you want fewer moving parts and less glue code.
The trade-off is straightforward. You lose flexibility, and probably accept more vendor lock-in than you’d prefer. A lot of enterprise teams will still take that deal.
Long-running agents expose the real problems
OpenAI is framing these updates around long-horizon tasks, and that’s the right frame.
One tool call is manageable. An 80-step workflow that reads files, edits artifacts, retries flaky API calls, and pauses for approval before doing something destructive is a different class of system. Small errors pile up. Shortcuts turn into reliability problems. Variance that looks harmless in a one-shot prompt starts dragging down results over dozens of steps.
That’s why checkpointing, idempotent tools, and human approval gates matter more than another polished planning demo.
A sane setup looks something like this:
agent = AgentSDK.create(
model="frontier-X",
workspace={
"id": "proj-mlops-migration",
"fs_mounts": [
{"path": "/workspace/repo", "mode": "ro"},
{"path": "/workspace/tmp", "mode": "rw"}
],
"net_policies": {
"egress_allow": ["https://api.internal.company"],
"default": "deny"
}
},
tools={
"allow": ["repo_browser", "sql_client", "ticket_api"],
"timeouts": {"sql_client": 15}
},
policies={
"writes": {"allow_paths": ["/workspace/tmp"]},
"human_gate": [{"step": "apply_migration"}]
},
generation={"temperature": 0.2}
)
The priorities there are sensible. Keep generation conservative. Restrict writes. Deny network access by default. Put a human gate in front of anything destructive.
That’s the kind of autonomy most companies can live with.
Python first makes sense
Shipping Python first is the obvious choice, and likely the right one.
Most enterprise automation still runs through Python. Data pipelines, internal tooling, MLOps scripts, batch jobs, repo automation, support workflows, ticketing glue, even a depressing amount of infrastructure logic. If OpenAI wants this SDK used inside real companies instead of at agent hackathons, Python is where those teams already are.
TypeScript will matter once the focus shifts toward browser workflows, frontend-heavy systems, and customer-facing apps. For the first wave of deployment, Python is the easier path.
There’s also a practical reason. Long-running agents often need access to data tooling, notebooks, ETL jobs, and internal service wrappers that already exist in Python. A Python-native SDK shortens the path from prototype to something a platform team can actually operate.
Where it helps, and where it won’t
This release fits workflows with bounded scope and a lot of coordination work. For example:
- triaging support tickets with internal policy checks
- reconciling invoices against finance systems
- scanning a codebase, preparing a patch, and opening a PR
- reviewing alerts, gathering context, and drafting an incident summary
- performing staged operations with approval gates
These are multi-step, tool-heavy tasks with enough structure to define safe boundaries.
It’s less convincing when tool access is wide open, task definitions are fuzzy, or the environment itself is chaotic. A sandboxed runtime can limit damage, but it won’t rescue a badly designed workflow. If your tool contracts are inconsistent, your APIs are flaky, and the business logic lives across five wiki pages, the agent will still behave like a confused intern with shell access.
That’s not OpenAI’s fault. The SDK also won’t save you from it.
Security teams will care more than prompt engineers
There’s a noticeable shift here.
For the past year, agent tooling has largely followed what model builders and app teams wanted: better planning, richer tool use, more memory, more autonomy. This update leans toward what security, compliance, and platform teams care about: isolation, policy enforcement, auditability, repeatability.
That usually signals a market getting more serious. Enterprise adoption rarely stalls because the model can’t do enough. It stalls because the controls around it are thin.
If OpenAI’s tracing format plugs cleanly into the observability stacks companies already run, and if the policy hooks are flexible enough to match internal approval flows, this gets easier to buy. OpenTelemetry alignment would help, though OpenAI hasn’t put that front and center.
It also raises the bar for the rest of the agent ecosystem. Anthropic, Google, Microsoft, and open source frameworks all have some answer for tool use and orchestration. The baseline is shifting. “Production-ready” now needs scoped execution, reliable state handling, and decent telemetry. Without those pieces, the stack looks half-finished.
The downside is vendor gravity
There’s a cost to this approach.
The more OpenAI owns the runtime, the more your agent system starts to depend on OpenAI-specific assumptions about planning, state, policy, tracing, and deployment. That may be acceptable if you’re already committed to its models and want speed. It’s less appealing if you expect to swap providers, mix orchestration backends, or keep your control plane independent.
This is a familiar platform bargain. You get tighter integration and fewer components to assemble. You also inherit someone else’s abstractions.
For a lot of teams, especially those running a 60- to 90-day pilot, that trade-off is fine. The bigger near-term risk is building an agent stack nobody can secure, debug, or operate after the demo.
How I’d approach it
If you’re evaluating this SDK, keep the first pilot narrow.
Pick one workflow with clear boundaries and visible business value. Lock down tool access aggressively. Deny egress by default. Treat the sandbox as untrusted. Make destructive actions idempotent or approval-gated. Turn on full audit logging on day one. Keep the prompts boring. In production, reliability matters a lot more than flair.
That’s what OpenAI seems to understand with this release.
The stack around agents is finally starting to look like something a serious company could run.
Useful next reads and implementation paths
If this topic connects to a real workflow, these links give you the service path, a proof point, and related articles worth reading next.
Design agentic workflows with tools, guardrails, approvals, and rollout controls.
How AI-assisted routing cut manual support triage time by 47%.
OpenAI says the part most vendors prefer to blur: if you build an AI browser that reads arbitrary web content and takes actions for the user, prompt injection is likely a permanent security problem. That comes from the company’s December 22 write-up ...
A lot of enterprise knowledge still sits in people’s heads, buried in Slack threads, scattered across docs, or trapped behind time zones. Viven wants to make that knowledge queryable. The startup, founded by Eightfold co-founders Ashutosh Garg and Va...
May Habib is taking the AI stage at TechCrunch Disrupt 2025 to talk about a problem plenty of enterprise teams still haven't solved: getting AI agents out of demos and into systems that actually matter. A lot of enterprise AI still looks like a chat ...