Did OpenAI run out of GPUs during the outage?

No, the failures occurred upstream of the GPU layer due to request-ingress saturation.

Why did recovery take until midday?

Clearing a high-priority queue backlog and waiting for slow GPU node cold starts stretched the recovery timeline.

How can teams avoid similar outages?

Use responsive autoscaling policies, optimize queue admission, and provision enough headroom in API gateways and load balancers.

Generative AI June 11, 2025

OpenAI outage hit ChatGPT, Sora, and API users through the West Coast workday

OpenAI’s partial outage this week hit three services developers actually use: ChatGPT, Sora, and the API. For teams on the U.S. West Coast, it landed right in the middle of the workday and dragged on much longer than OpenAI’s usual sub-two-hour incid...

OpenAI’s long outage shows where AI platforms still break under pressure

The timeline is fairly clear. OpenAI saw elevated latency and error rates late Monday night. By about 5:30 a.m. PT Tuesday, engineers had traced the problem to overloaded request handling ahead of the GPU layer. Recovery still took until midday. That matters. Finding the fault was only part of the job. Clearing the backlog without hitting another limit was the harder part.

This looks like an ingress problem

Outage chatter around AI services tends to collapse into one explanation: they ran out of GPUs. That doesn’t fit especially well here.

The available details point to saturation in the request-ingress path, the systems that accept and route traffic before it reaches inference workers. API gateways, load balancers, queues, admission logic. When that layer clogs up, requests fail at the front door even if some inference capacity still exists behind it.

That matches the reported symptoms:

elevated latency before outright failures
“Too many concurrent requests” style errors
a high-priority queue backlog
slow recovery after fixes were deployed

If the gateway and queueing path are jammed, adding workers helps less than people think. Traffic is still moving through a narrow pipe, and a backlog of high-priority jobs can keep newer requests waiting far longer than expected.

That’s a familiar distributed-systems failure mode. AI just makes it pricier.

A plausible chain of failure

OpenAI hasn’t published a full postmortem, so there’s only so far you can go. Still, the architecture described in the reporting is common enough to sketch the likely path.

A large-scale inference stack usually looks like this:

API gateway terminates TLS, authenticates requests, and enforces rate limits.
Load balancers spread traffic across edge services and availability zones.
Queues buffer bursts when inference workers fall behind.
Kubernetes-managed workers run model-serving endpoints tied to GPU resources.
Autoscalers try to add pods and, if needed, nodes.

The weak points are obvious:

bursty traffic can overwhelm queue admission before workers can drain it
aggressive rate limiting can turn a slowdown into a broad rejection event
conservative autoscaler thresholds react too late
GPU node cold starts take minutes
cluster quotas limit how much elasticity you actually have

The source material points to all of that. Horizontal Pod Autoscaler thresholds around 70% CPU or GPU utilization are fine for cost control, but they’re slow under sharp bursts. If node-level GPU caps are also in play, scaling workers doesn’t buy much because the cluster can’t add real capacity fast enough. And if new GPU nodes take two to three minutes to come online, the queue can snowball before the system catches up.

That’s probably why the incident ran so long. Locating the problem is one thing. Recovering from a queue collapse while live traffic keeps coming is harder.

Inference still scales badly

This outage is a useful reminder that inference infrastructure behaves differently from ordinary stateless web traffic.

You can scale API servers cheaply and fast. Large-model inference doesn’t work that way. GPU-backed workers are expensive, slower to start, and usually constrained by quota, scheduler policy, and physical supply. Add multimodal workloads like Sora, which can be much heavier than text generation, and shared platform pressure rises quickly.

So providers live with a constant trade-off between utilization and headroom.

Run the fleet hot and the economics look good until a burst hits. Keep wide safety margins and the cost model gets ugly. Every AI provider is dealing with that, including the biggest ones. OpenAI just had to do it in public.

There’s also a control-plane question. The outage was partial, which suggests the issue may not have been globally uniform. Regional bottlenecks, or trouble shifting traffic between regions, could easily produce this kind of patchy but stubborn degradation. If cross-region failover were seamless, a local saturation event should have been shorter and more contained.

That’s where the comparison to hyperscalers gets uncomfortable. Google and AWS have spent years hardening global traffic steering and failover under ugly real-world conditions. Model serving adds its own complications, but customers are starting to expect similar behavior from AI providers. Fair enough.

For developers, the lesson is dull and expensive

If your product depends on one external AI endpoint, you need a real failure plan.

Too many teams still wire LLM calls into production as if they were talking to a mildly flaky SaaS API. That falls apart when the upstream service degrades for half a workday.

A few patterns matter right away.

Retry logic has to be disciplined

Blind retries are a good way to help finish off a service that’s already struggling. If you retry, use capped exponential backoff with jitter and pay attention to the provider’s error semantics.

import random
import time

def call_with_backoff(fn, max_retries=5):
for attempt in range(max_retries):
try:
return fn()
except Exception:
delay = min(2 ** attempt, 16) + random.uniform(0, 1)
time.sleep(delay)
raise RuntimeError("Retries exhausted")

That’s baseline hygiene. It won’t save you from a multi-hour outage, but it does reduce self-inflicted damage.

Fallback models are worth the pain

If one model tier fails, can you route some traffic to a smaller model, another endpoint, or even a second provider? For plenty of use cases, degraded quality is better than total failure.

That only works if you’ve classified requests by importance. A summarization feature can often fall back. A safety classifier or a production agent step may need stricter guarantees.

Stop treating all prompts as equal.

Queue priority should match the business

A lot of teams send interactive traffic and batch jobs through the same lane. Then an incident hits and background workloads keep fighting with customer-facing requests.

Separate them. Put hard ceilings on low-priority batch work. If the upstream provider starts wobbling, protect requests tied to live user sessions.

This is basic service design. AI teams often skip it because the prototype ships and nobody comes back to clean up the plumbing.

The Kubernetes detail matters

One of the more revealing details here is that worker pod scaling was reportedly slowed by Kubernetes autoscaler configuration limits.

That sounds ordinary. It isn’t.

Kubernetes is fine at orchestrating containers. It does nothing to solve GPU scarcity, startup lag, topology constraints, or queue behavior under burst load. Inference systems that look clean on a diagram still get trapped by very mundane settings: max node group size, scale-up cooldowns, utilization thresholds, pod disruption budgets, taints and tolerations, GPU device plugin limits.

A lot of AI platform reliability comes down to whether those defaults were tuned for bad days instead of average ones.

“Just autoscale it” remains one of the emptiest phrases in ML infrastructure.

This will strengthen the case for multi-provider setups

Not every company can justify multi-provider inference. It gets messy fast: model behavior differences, prompt portability issues, evaluation overhead, cost tracking, vendor-specific tooling. Still, outages like this make the argument easier.

If one provider handles your primary path, an alternate may be worth having for:

low-latency fallback
lower-tier service continuity
non-critical batch rerouting
region-specific resilience

Open-source models also get a lift from incidents like this, especially for internal workloads where exact parity with frontier models doesn’t matter. A self-hosted or managed open-weight fallback won’t match the best commercial models across the board, but it does give teams a pressure-release valve.

That matters because AI has moved well past the demo layer. It now sits inside customer support flows, dev tools, internal copilots, search, document processing, analytics, and product UX. Once that happens, downtime stops being embarrassing and starts costing real money.

What to audit now

You don’t need to wait for OpenAI’s postmortem to fix the obvious gaps.

Check:

timeout settings for model calls
retry logic and whether it amplifies failures
fallback behavior by request type
provider-specific circuit breakers
queue separation between interactive and batch traffic
user-facing degradation modes
dashboards for p95 and p99 latency, queue depth, and upstream error rates
your recovery plan if the provider is impaired for four hours instead of four minutes

Most teams have some of this. Fewer have all of it wired together and tested under load.

That’s what this outage exposed. The question isn’t only whether OpenAI can scale fast enough. It’s whether customers have built systems on the assumption that a major model provider will occasionally have a bad day.

They will. This week made that harder to ignore.

Keep going from here

Useful next reads and implementation paths

If this topic connects to a real workflow, these links give you the service path, a proof point, and related articles worth reading next.

Relevant service

AI model evaluation and implementation

Compare models against real workflow needs before wiring them into production systems.

Related proof

Internal docs RAG assistant

How model-backed retrieval reduced internal document search time by 62%.

Disney brings 200-plus characters to OpenAI's Sora in a $1 billion bet

Disney has signed a three-year deal with OpenAI to bring more than 200 characters from Disney, Pixar, Marvel, and Lucasfilm into Sora and ChatGPT Images. It's also investing $1 billion in OpenAI. The bigger shift is what the deal says about the marke...

OpenAI opens ChatGPT app submissions and expands in-product app discovery

OpenAI has opened submissions for a ChatGPT app directory and is rolling out app discovery inside ChatGPT’s tools menu. Its new Apps SDK, still in beta, gives developers a formal way to plug services into ChatGPT so the model can call them during a c...

OpenAI inside ChatGPT raises a harder question for Apple's AI strategy

OpenAI’s move to let third-party apps run inside ChatGPT brought back an old idea: the app icon may not matter much if one assistant window can handle travel, playlists, shopping, and work. If that shift sticks, the home screen stops being the main w...