What is scheduler-aware replication?

A feature that automatically replicates datasets to the regions where compute jobs are scheduled, minimizing latency and idle GPU time.

Why are egress fees a problem for AI data movement?

Because large training datasets incur high bandwidth charges when transferred between cloud regions, driving up overall infrastructure costs.

How does AI-native storage differ from traditional object storage?

It treats storage as a distributed system that stages and replicates data based on compute placement, rather than a centralized bucket with fixed location.

Artificial Intelligence October 12, 2025

Why AI infrastructure is going multi-cloud, but data storage still is not

AI infrastructure has already come apart into pieces. Teams train on CoreWeave, fine-tune on Lambda, run inference somewhere else, and watch spot pricing the whole time. Compute moves around far more easily than it did a few years ago. Data usually d...

Tigris Data is betting AI storage should follow GPUs, not the cloud region

Data usually doesn't.

That’s the opening Tigris Data is chasing with a fresh $25 million Series A led by Spark Capital, with Andreessen Horowitz participating again. The startup, founded by the team behind Uber’s storage platform, says it has more than 4,000 customers and is building a distributed storage network aimed squarely at AI workloads. Its current footprint includes Virginia, Chicago, and San Jose. London, Frankfurt, and Singapore are next.

The pitch is straightforward: put storage near the GPUs you actually use, replicate data across locations, and stop paying every time your infrastructure crosses clouds or regions.

That argument works because the old model is getting expensive in two ways. One is latency. If your data lives in one cloud region and your GPUs are somewhere else, every training run and inference path starts with a bad compromise. The other is egress. Big cloud providers have spent years turning data gravity into a revenue stream. Once your datasets are parked in their object store, moving them gets slow and expensive fast.

Tigris wants storage to behave more like a distributed system and less like a central bucket.

Why the timing makes sense

This is not some generic storage company trying to slap AI branding on object storage. The timing is real.

Top-end AI compute is now a fragmented market. Plenty of teams no longer assume AWS, Azure, or Google Cloud is where training jobs should run. Specialized GPU clouds, regional providers, private clusters, and hybrid setups are common. Inference is even more spread out because latency and geography matter more. Agentic workloads add another problem: lots of small reads, frequent writes, and data scattered across regions.

That breaks a lot of assumptions behind classic cloud object storage. Those systems are good at durable, cheap, centralized storage. They’re less good when workloads keep moving and datasets need to sit close to whichever GPU cluster got the job this week.

Fal.ai, one of Tigris’s customers, put it bluntly. According to Tigris, egress used to be the majority of its cloud spend. If that’s even roughly true across similar AI companies, the problem is obvious. You can spend months optimizing model serving and still lose on data movement.

What Tigris appears to be building

Tigris describes its system as AI-native distributed storage. That phrase gets abused, so the useful question is what it means in practice.

Based on what the company says it can do, the product probably rests on a few core ideas.

Compute-aware data placement

The important part is scheduler-aware replication. If your jobs land in a CoreWeave zone today and a different cluster tomorrow, the storage layer should stage or replicate the right datasets there before GPUs sit idle waiting on reads.

That implies some kind of control plane that understands placement policies and dataset metadata. Think rules like:

keep three replicas across us-east and eu-central
pin the newest fine-tune shards near A100 capacity
keep agent state close to user-facing inference regions

If Tigris gets this right, it cuts one of the nastiest costs in AI systems: expensive accelerators waiting around for data.

Low-latency reads, especially at the tail

Training jobs can stream huge files. Agent pipelines and RAG systems often do the opposite. They hit storage with lots of tiny objects, metadata lookups, embeddings, and intermediate artifacts. A storage platform built for AI has to handle both without falling apart at p95 and p99 latency.

That usually points to hot NVMe tiers for active data, aggressive prefetching, and fast I/O paths like io_uring or kernel-bypass networking. In-region RDMA or RoCEv2 would also fit the performance story if the backend is tuned for it. Tigris hasn’t published a full architecture doc, so some of this is inference rather than confirmation.

Still, the claim that it supports billions of small files says a lot about where the engineering work is going.

The small-files problem

This is one of the least glamorous parts of AI infrastructure, and one of the most painful.

Object stores tend to struggle when workloads generate huge numbers of tiny artifacts. AI pipelines do this constantly: shards, embeddings, checkpoints, tokenizer fragments, feature tiles, logs, temporary outputs. Capacity isn’t the issue. Metadata overhead is.

A serious design here probably includes sharded metadata services, in-memory indices, write-ahead logs, and some kind of small-object packing to cut metadata amplification. S3-compatible APIs matter too, because nobody wants to rewrite half the stack just to test a new storage backend.

If Tigris can keep strong S3-compatible support while holding down tail latency, that likely matters more than any AI-native label in the sales deck.

The hard parts it still has to prove

The idea is solid. Making it work cleanly is harder.

Distributed storage turns trade-offs into product decisions. Move data closer to compute and you create more replication traffic and more consistency headaches. Optimize for fast local reads and global writes get harder. Promise low latency and noisy-neighbor problems still show up.

There are four areas where technical buyers should stay skeptical.

Consistency

A platform like this probably has to offer strong reads-after-writes within a region and weaker guarantees across regions. That’s sensible. It’s also where bad assumptions and ugly bugs tend to surface.

Training pipelines that mix readers and writers in the same working set need predictable semantics. Inference systems can often live with bounded staleness. Agents are messier because they mix interactive latency with stateful writes. If Tigris stays vague on guarantees, buyers should press for details.

Cost accounting

“Cheaper than cloud egress” is a good line. Total cost is still messy. Replication, background rebalancing, private interconnects, API limits, and cross-provider traffic all add up.

A distributed storage layer can cut the bill. It can also move costs into places that are harder to predict. Good buyers will model total data motion, not just storage at rest and list-price bandwidth.

Security and sovereignty

Tigris’s expansion map matters here. London, Frankfurt, and Singapore are not just attractive AI infrastructure markets. They’re places where data residency rules shape architecture.

For enterprise buyers, this only works if the company supports the usual controls: mTLS, AES-256 at rest, customer-managed keys, audit logs, WORM-style immutability, and clear residency policies. “Your data stays where you say it stays” is now table stakes.

Operational integration

A storage platform like this gets better if the orchestrator can pass locality hints. Kubernetes annotations, Ray integration, batch schedulers, custom placement policies, all of that matters. Without it, you end up managing placement by hand, which defeats much of the point.

That’s also why POSIX support matters less than many enterprise buyers assume. S3-compatible access is the practical default for broad adoption. POSIX gateways help with older pipelines, but they usually bring overhead and complexity with them.

What engineering teams should take from this

If you run AI workloads in one cloud region and expect that to stay true, Tigris may be overkill. Centralized object storage is still simple, cheap, and boring in the best sense.

If your workloads are already split across providers or regions, this category is worth watching.

A few practical points stand out:

Pre-stage data before GPU jobs start. Idle accelerators cost more than extra storage copies.
Version datasets like code. Immutable snapshots and content-addressed manifests matter for reproducibility, audits, and rollback.
Pack tiny artifacts where possible. Even smart storage backends pay for pathological small-object workloads.
Set placement policies by workload. Training wants cheap nearby GPUs. User-facing agents want proximity to users and tight tail latency. Those are different rules.
Test failure modes. Regional failover, stale reads, key rotation, and noisy-neighbor behavior matter more than benchmark charts.

There’s a broader shift under this funding round. Storage is starting to look like a scheduler problem. The vendors that win here probably won’t be the cheapest S3 clones. They’ll be the ones that understand where jobs land, where data is allowed to live, and how much inconsistency a workload can tolerate.

That’s a harder business than selling buckets.

Tigris is making a credible bet that AI teams are done treating centralized storage as a fixed constraint. The company still has plenty to prove, especially on consistency guarantees, real-world performance, and cost transparency. But the thesis is strong because the pain is real. Once compute goes multi-cloud, storage has to follow, or it becomes the bottleneck you pay for twice.

Keep going from here

Useful next reads and implementation paths

If this topic connects to a real workflow, these links give you the service path, a proof point, and related articles worth reading next.

Relevant service

Data engineering and cloud

Fix pipelines, data quality, cloud foundations, and reporting reliability.

Related proof

Cloud data pipeline modernization

How pipeline modernization cut reporting delays by 63%.

Runpod reaches $120M ARR as GPU cloud demand pulls in 500,000 developers

Runpod says it has reached a $120 million annual revenue run rate, with 500,000 developers on the platform and infrastructure across 31 regions. For a company that started in 2021 from a Reddit post and some reused crypto mining gear, that's a sharp ...

Andy Jassy's AI case is really about infrastructure, not model APIs

Andy Jassy is making a straightforward case: companies need to spend hard on AI now. Not on a few model APIs bolted onto old products. On the infrastructure underneath it, and on the product decisions that determine where AI actually belongs. Amazon ...

How AI startup architecture is changing, according to January Ventures

Jennifer Neundorfer, managing partner at January Ventures, is set to speak at TechCrunch All Stage on July 15 at Boston’s SoWa Power Station about how AI is changing startup construction. The useful part of that argument isn’t the familiar point abou...