Scaling AI agents on Google Cloud from demo to production
AI startups keep running into the same problem: a polished demo says very little about whether the system will hold up in production. At TechCrunch Sessions: AI on June 5 at UC Berkeley, Google Cloud’s Iliana Quinonez is set to talk about that gap. T...
Google Cloud’s Iliana Quinonez sketches the stack AI agents actually need
AI startups keep running into the same problem: a polished demo says very little about whether the system will hold up in production. At TechCrunch Sessions: AI on June 5 at UC Berkeley, Google Cloud’s Iliana Quinonez is set to talk about that gap. That’s the useful part.
The framing is practical. Quinonez is focusing on reasoning-capable agents, the data pipelines behind them, and where proprietary IP still lives when the model layer comes from someone else. Those are the right questions. A lot of teams are still debating model choice while building fragile systems around it.
Google Cloud’s pitch, based on the preview, is a modular stack: Cloud Run and Pub/Sub for orchestration, Vertex AI for model endpoints and embeddings, Firestore or Memorystore for state, and a vector database for retrieval. None of that is surprising by itself. The interesting part is the assumption underneath it. Agent systems should be built like event-driven applications that happen to include ML, not thin wrappers around a model API.
The hard part sits around the model
A basic LLM app can survive on a request-response loop. Prompt in, output out, maybe retrieval in the middle. Agents break that quickly.
Once a system has to keep state, choose between tools, retry failures, or pass work across sub-agents, the model stops being the main engineering problem. Orchestration takes over. That’s where systems either stay maintainable or turn into prompt spaghetti.
Quinonez’s reference architecture follows the pattern most serious teams are converging on:
- an API gateway for auth, rate limits, and telemetry
- an orchestration service for tasks and context
- a vector store for retrieval
- function services for domain actions
- a model endpoint for generation and reasoning
That sounds obvious until you see how many teams still cram all of it into one Python service with a few async functions.
Cloud Run plus Pub/Sub is a sensible startup choice for orchestration. It gives you loose coupling early without paying the Kubernetes tax on day one. Pub/Sub helps with bursts, retries, and fan-out. Cloud Run keeps ops light. The trade-off is latency. Every hop adds delay, and multi-step agents already spend time on tool calls, retrieval, and inference. If you need sub-second UX, this setup takes discipline. If you’re building enterprise workflows that can tolerate a few seconds, it’s much easier to justify.
Reasoning-capable agents need memory, tools, and routing
This is usually where people get vague. The implementation details here are better than that.
A reasoning-capable agent needs three things beyond a base model:
State management
You need somewhere to keep conversation history, workflow progress, tool outputs, and often user-specific context. Firestore or Memorystore can both work, depending on whether persistence and queryability matter more than raw low-latency access.
The hard part isn’t storage alone. It’s deciding what stays in short-term context, what gets summarized, and what belongs in longer-lived memory. Teams that ignore that usually watch token costs climb while answer quality gets worse.
Tool integration
Agents are useful when they can do concrete work: query inventory, fetch market data, check a CRM record, run an internal function. That means building and securing tool interfaces, not writing smarter prompts.
Google’s proposed setup puts those domain actions in separate microservices. That’s the right call if you care about auditability and access control. A function like “check supply chain status” should behave like any production service, with logs, IAM, versioning, and failure handling. Hiding that logic inside an opaque chain definition is asking for trouble.
Decision logic
The orchestration service decides when to call a tool, when to ask a sub-agent, and when to stop. That matters because agent failures often come from routing, not the model. The system picks the wrong tool, loops too long, or keeps asking the model to reason about data it should have fetched directly.
This is also one of the few places teams can still build defensible IP. The model API is rented. Task decomposition, confidence heuristics, execution policies, and a domain-specific tool graph are not.
The data pipeline matters more than it gets credit for
The source material also points to a fairly standard Google Cloud pipeline: Pub/Sub to Dataflow to Cloud Storage, with batch transforms in Dataflow or Spark on Dataproc, plus Vertex AI Feature Store and metadata tracking.
That may sound ordinary next to agent infrastructure. It isn’t. Agent quality depends heavily on whether the system can trust its own data.
Messy logs, stale retrieval corpora, and drift between online and offline features all produce failures that look like hallucinations. They’re often plain data plumbing problems. Those are harder to spot because the system still replies in fluent prose.
The inclusion of feature stores and lineage tooling stands out. In classic ML, feature stores addressed training-serving skew. In agent systems, they help manage the structured signals around the model, things like customer tier, fraud flags, support history, inventory state, or market snapshots, with some consistency across environments.
That also matters for compliance. If an agent makes a recommendation in finance or healthcare, “the model said so” won’t satisfy anyone. Metadata, lineage, and reproducibility stop looking like MLOps ceremony and start looking like table stakes.
The sample Apache Beam pipeline in the source material is simple, but it makes the point well. Stream logs from Pub/Sub, parse them, filter successful events, write processed data to Cloud Storage. Real systems will add schema validation, dead-letter queues, PII scrubbing, and probably warehouse sinks for analytics. The pattern is still the same: treat agent telemetry as production data, not debug exhaust.
Where the IP line sits
Quinonez’s comments on protecting core IP are probably the sharpest part of the blueprint.
Most startups building on public LLM APIs are renting the most expensive layer in the stack. They should be careful about what else they give away. If the company’s value lives in prompt chains stuffed into app code, that’s weak protection. If the value lives in a controlled orchestration layer, proprietary tools, private data products, and internal policies behind secure services, that’s harder to copy.
The recommendations are standard and sensible:
- keep sensitive prompts and chains in Secret Manager or Vault
- version-control them with GitOps practices
- put proprietary services behind VPC Service Controls
- abstract model calls behind an internal interface so providers can be swapped
That last point matters. Model portability is never free. Providers differ on context windows, tool-calling semantics, latency, quotas, and fine-tuning support. An abstraction layer still helps because it stops application code from hardwiring itself to one vendor’s API quirks. You may not get true plug-and-play portability, but you can avoid a painful rewrite.
Managed services help, and they narrow the path
Google Cloud is pushing a stack that mixes managed AI services with open-source components and custom services. For a lot of startups, that’s a reasonable compromise. You move faster on infrastructure you don’t want to own, while keeping product-defining pieces under tighter control.
There’s a price.
Managed services remove a lot of operational work, but they also pull architecture toward whatever the cloud vendor supports best. If your agent pipeline depends heavily on Vertex AI embeddings, Feature Store, Metadata, and Google-native security controls, switching later gets harder. The upside is speed and integrated governance. The downside is that your roadmap starts inheriting platform assumptions.
That’s fine, as long as nobody pretends it’s neutral.
What senior teams should watch
For engineering leads and staff-level developers, the useful signal here is the operating model.
A few things stand out:
- Event-driven design is becoming the default for agent workflows with more than one or two tool calls.
- Observability has to go past model latency and cover routing decisions, context growth, tool success rates, and retries.
- Security boundaries show up earlier because agents touch more internal systems than a typical chatbot.
- Cost accounting needs stage-level visibility so you can see whether spend comes from embeddings, retrieval, orchestration churn, or inference.
For teams building support agents, internal copilots, or workflow automation, the work keeps snapping back to software engineering basics: service boundaries, telemetry, data hygiene, and clear failure modes. The agent label doesn’t erase any of that. It raises the cost of doing it badly.
For data scientists, the center of gravity is shifting too. Model quality still matters. Production agent systems also demand a lot more attention to lineage, drift, latency budgets, and evaluation against multi-step tasks. Scoring a single model output isn’t enough when the user-visible result depends on five intermediate decisions and three service calls.
Quinonez’s session looks useful because it stays anchored in that reality. AI agents usually fail because the surrounding system can’t carry the load. That’s an infrastructure problem before it’s a model problem.
Useful next reads and implementation paths
If this topic connects to a real workflow, these links give you the service path, a proof point, and related articles worth reading next.
Design agentic workflows with tools, guardrails, approvals, and rollout controls.
How AI-assisted routing cut manual support triage time by 47%.
Google left Cloud Next with its usual stack of AI announcements, but a few stand out for people who actually have to ship things. The headline model is Gemini 2.5 Pro Experimental, which Google calls its strongest reasoning model so far. More interes...
Enterprise IT consulting still runs on a model that hasn’t changed much in 20 years: large teams, layered staffing, long statements of work, and billing tied to hours or fixed project blocks. Gruve.ai is arguing for something else. Its pitch is strai...
TechCrunch’s latest Startup Battlefield selection says something useful about where enterprise AI is headed. Not toward bigger chatbots. Toward agents that can be monitored, constrained, audited, and tied into real systems without triggering complian...