Box CEO Aaron Levie on why AI agents still need enterprise SaaS
Aaron Levie made a useful point at TechCrunch Disrupt. Enterprise SaaS apps are not about to vanish under a swarm of autonomous agents. They’re becoming the structured layer agents sit on top of. That matters because a lot of enterprise AI is still s...
Box’s Aaron Levie has the right read on enterprise AI: agents need a control plane, not a coup
Aaron Levie made a useful point at TechCrunch Disrupt. Enterprise SaaS apps are not about to vanish under a swarm of autonomous agents. They’re becoming the structured layer agents sit on top of.
That matters because a lot of enterprise AI is still stuck in copilot mode, which often means a chatbot glued onto a workflow that was never built to handle model uncertainty. Levie’s framing is more grounded: keep the core system strict, auditable, and policy-driven, then let AI work where ambiguity is acceptable and speed helps.
He described it as a kind of “church and state” split between deterministic software and non-deterministic AI. That’s a solid model, especially for teams wiring agents into finance, legal, HR, or customer ops.
The architecture shift is real
The key part of Levie’s argument is straightforward. Enterprise software needs a layer that manages agents safely.
Call it an AI control plane. A lot of vendors already do.
That layer handles model routing, prompt management, tool definitions, policy checks, redaction, logs, retries, fallbacks, and cost controls. Without it, agentic software turns into a messy stack of prompt chains with too much access.
The pattern taking shape is familiar:
- A deterministic core for workflows, approvals, identity, audit logs, and data integrity
- An AI orchestration edge for retrieval, planning, summarization, tool use, and low-confidence decision support
That split has real consequences. There’s a big difference between a model suggesting the next step and a model closing the books after reading a few documents.
Enterprise teams already know how to build the first part. BPMN engines, state machines, RBAC, ABAC, immutable logs, versioned APIs, idempotent writes. Boring systems. Also the systems that keep companies out of trouble.
The second part is newer and less settled. Tool-calling runtimes, JSON schema validation, constrained decoding, retrieval pipelines, agent traces, prompt registries, and policy engines such as OPA are getting better fast. They still need a hard boundary around what the model can see, suggest, and do.
That boundary is where the real engineering work starts.
“Non-deterministic propose, deterministic commit” still looks right
The cleanest design rule here is simple: let the model propose, let software commit.
In practice, an agent can read context through APIs or RAG, build a plan, fill structured arguments for a tool call, and rank possible actions. The actual state change should still go through deterministic services with permission checks, validation, and audit.
A sensible production flow looks like this:
- Agent reads from scoped APIs or policy-filtered retrieval.
- It proposes an action plan in a structured format.
- A policy layer checks scope, tenancy, risk level, and preconditions.
- High-risk actions go through human approval or a sandbox.
- A normal application service executes the write through idempotent APIs.
Yes, that’s conservative. Good.
A lot of autonomous-agent demos still jump straight from model output to side effects. Fine for a stage demo. Bad idea for procurement, claims processing, or regulated document workflows.
If you’re shipping agents into enterprise systems, assume the model will eventually do something weird. Build so that weird behavior is visible and contained.
Levie’s pricing point matters too
Levie also made a blunt commercial point: per-seat pricing starts to break when agents outnumber people by 100x or 1,000x.
He’s right. Traditional SaaS pricing assumes the unit of value is a human user sitting in front of a UI. In agent-heavy software, the billable unit may be a workflow run, a batch of tool calls, retrieved documents, inference volume, or a completed outcome.
That shift is going to hit product design faster than a lot of incumbents expect.
If your platform suddenly has thousands of non-human identities triaging tickets, drafting contracts, checking policy, or reconciling documents, named-seat pricing stops making sense. So does admin tooling built around the assumption that every actor is a person.
Engineering teams should care because pricing pushes architecture around. Once billing depends on actions and compute, you need reliable metering for:
- token and inference usage
- retrieval and vector storage operations
- tool invocation counts
- latency and retry behavior
- per-tenant cost attribution
- concurrency limits and kill switches
A lot of AI products still treat cost tracking like cleanup work. That won’t hold up in enterprise deployment.
Identity is the hard part
Levie’s point about having vastly more agents than people also points to a weak spot in a lot of current systems: identity and access control for non-human actors.
Every agent should be treated as a first-class identity, not a vague extension of a user session.
That means scoped OAuth clients, short-lived credentials, explicit tenancy binding, and per-tool permissions. If an agent can search documents but not export them, or draft an invoice but not approve payment, that has to be enforced at the service boundary. A prompt is not a security control.
A reasonable baseline looks like this:
- issue short-TTL credentials through OAuth2 client credentials flow or equivalent
- scope every tool separately
- route actions through a policy decision point
- maintain immutable logs for prompt input, tool call, and downstream commit
- keep retrieval tenant-aware at the index layer, not just the app layer
Retrieval is an easy place to get sloppy. If your RAG pipeline pulls from shared embeddings without strong tenant isolation or row-level filtering, one bad config can turn into a quiet data leak. Enterprise buyers are right to ask about that now.
Startups do have an opening
Levie is also right that startups have an edge if they build agent-first from the start. They don’t have to retrofit a 2016-style UX onto software that should be event-driven, API-first, and mostly invisible until a human needs to step in.
Still, “we use agents” is not a wedge. Everybody says that now. The real opening is lower operational complexity.
If a startup can keep the model inside tight schemas, make tool calls strongly typed, sandbox risky actions, and bring in human review only where it’s warranted, the product can feel much faster than a legacy suite.
There’s a catch. Agent-first systems come with a nasty reliability tax.
Latency piles up across retrieval, reasoning, tool execution, retries, and approval gates. Failures get harder to classify. Evaluation gets expensive. Regression testing gets strange because outputs are probabilistic while the surrounding business system still needs deterministic outcomes.
Ignore that and you get software that demos well and generates support tickets later.
The work that matters here is not glamorous:
- cache embeddings and deterministic tool responses
- prefetch context on event triggers
- use queues for long-running agent jobs
- dedupe requests with idempotency keys
- add circuit breakers when confidence drops or tools time out
- run offline evals on golden datasets
- shadow-test new prompts and models before rollout
That’s the gap between an agent feature and an agent platform.
What developers and tech leads should do now
If you’re shipping AI into enterprise software this year, a few priorities are hard to dodge.
Build the control plane early
Centralize model access behind an internal SDK or gateway. Standardize prompt versioning, tool schemas, retries, safety filters, and logs. Don’t let every team call foundation models directly through one-off wrappers.
That’s how you end up with six agent systems, no observability, and a miserable security review.
Treat tool interfaces like real APIs
Use typed inputs, JSON schema validation, and narrow permissions. If a model can call a tool, the output should be machine-checkable before anything runs.
Free-form text works for summaries. It’s a bad format for payroll actions.
Start in suggestion mode
Auto-commit should be earned. Start with draft mode, approval queues, and canary environments for destructive actions. Move to full automation when the reliability data says you can.
Instrument everything
You want trace_ids that tie together prompt, retrieval hits, model version, tool calls, policy checks, and final writes. Otherwise debugging turns into archaeology.
And yes, cost logs matter as much as error logs.
Levie’s framing fits how enterprises actually work
Enterprises do not need another speech about AI replacing software. They need systems that survive audits, pricing reviews, security reviews, and the usual mess of production.
That’s why Levie’s framing works. SaaS remains the system of record. AI becomes a layer on top that speeds things up without taking unchecked control.
For engineers, that’s a practical direction. Build deterministic cores. Give agents narrow tools. Meter everything. Keep the stochastic parts near the decision boundary and away from the commit path.
The teams that get this right probably won’t have the flashiest demos. They’ll be the ones still standing after procurement, security, and finance are done with them.
Useful next reads and implementation paths
If this topic connects to a real workflow, these links give you the service path, a proof point, and related articles worth reading next.
Design agentic workflows with tools, guardrails, approvals, and rollout controls.
How AI-assisted routing cut manual support triage time by 47%.
May Habib is taking the AI stage at TechCrunch Disrupt 2025 to talk about a problem plenty of enterprise teams still haven't solved: getting AI agents out of demos and into systems that actually matter. A lot of enterprise AI still looks like a chat ...
Enterprise IT consulting still runs on a model that hasn’t changed much in 20 years: large teams, layered staffing, long statements of work, and billing tied to hours or fixed project blocks. Gruve.ai is arguing for something else. Its pitch is strai...
AI agents can call tools, chain prompts, hit APIs, read docs, schedule jobs, and write code. Then they hit a very ordinary constraint: paying for things. That’s the gap Sapiom wants to fill. The startup has raised a $15 million seed round led by Acce...