Why isn’t this a single unified 200,000-GPU supercluster?

Cross-region latency and bandwidth limits make distributed training inefficient.

Artificial Intelligence October 16, 2025

Microsoft taps Nscale for 200,000 Nvidia GB300 GPUs across four sites

Microsoft has signed a large capacity deal with Nscale, the AI cloud and infrastructure company founded in 2024, to deploy about 200,000 Nvidia GB300-class GPUs across four sites in the US and Europe. The topline is huge. The site list is what gives ...

Microsoft’s Nscale deal shows where AI infrastructure is headed: fewer regions, bigger power, nastier bottlenecks

Microsoft has signed a large capacity deal with Nscale, the AI cloud and infrastructure company founded in 2024, to deploy about 200,000 Nvidia GB300-class GPUs across four sites in the US and Europe.

The topline is huge. The site list is what gives it shape:

Texas: roughly 104,000 GPUs over the next 12 to 18 months at a site leased from Ionic Digital, with a stated target of 1.2 GW at that location
Sines, Portugal: 12,600 GPUs starting in Q1 2026
Loughton, England: 23,000 GPUs beginning in 2027
Narvik, Norway: 52,000 GPUs at Microsoft’s AI campus

Nscale says the buildout combines its own deployments with a joint venture tied to investor Aker. The company is young, but it already has over $1.7 billion in backing from partners including Aker, Nokia, and Nvidia.

This is where AI capacity planning has gone. The question used to be how many H100s you could get. Now it's power procurement, cooling, network fabric, and whether the software stack can keep a giant cluster busy enough to justify the cost.

Start with the power

The 200,000-GPU figure grabs attention. The number that matters more is 1.2 gigawatts in Texas.

That tells you what these projects have become. They look a lot like utility infrastructure with GPUs layered on top. Buying Nvidia hardware is only one part of it. You also need grid interconnects, substations, liquid cooling, backup systems, fiber, and an ops team that can keep the whole thing from stalling because one layer of the stack slips.

The rough math is enough to make the point. If GB300-class GPUs draw somewhere around 1 to 1.2 kW each, accelerator TDP alone lands around 200 to 240 MW for 200,000 GPUs. That excludes CPUs, memory, switches, storage, power conversion losses, and cooling overhead. A 1.2 GW campus footprint starts to look less extravagant once you account for growth, redundancy, and the usual inefficiencies.

That changes who wins. The advantage goes to operators that can secure power and turn it into working capacity on schedule.

Four campuses, not one giant training cluster

The geography matters. Texas, Portugal, England, Norway. That helps with sovereignty requirements, resilience, and energy sourcing optics. Norway and Portugal fit the current pattern especially well: renewable-heavy grids, cooler climates, and governments that want this investment.

Still, nobody should read this as a single seamless 200,000-GPU supercluster.

Cross-region distributed training is still a bad fit for serious workloads. The speed of light is still there, being annoying. Large training jobs need RDMA-class low latency and huge east-west bandwidth inside one region or campus fabric. Once traffic starts crossing oceans, or even long terrestrial links, the efficiency hit gets ugly fast. All-reduce overhead will eat you alive.

The practical model is familiar:

Training stays regional
Inference gets distributed
Replication follows user demand, energy pricing, and regulatory boundaries

That works well enough for global products. Train where the fabric is dense and the power deal makes sense. Put inference closer to users and inside the right compliance boundary. It also means the global GPU total says less about any single training run than the headline suggests.

The architecture is predictable. Operating it isn't.

Nscale and Microsoft haven't published full system SKUs, but the broad shape is easy to infer because the industry has settled into a standard pattern.

Inside the node or rack domain, expect tightly coupled NVLink and NVSwitch fabrics, likely in 72-GPU NVL-style islands or something close to that. That's where model-parallel workloads want to live. Tensor-parallel groups should stay inside those high-bandwidth domains whenever possible.

Past that, the cluster fabric is likely 400 to 800 Gbps InfiniBand or top-end Ethernet with RoCEv2, ECN, and the usual congestion-control tuning. At this size, topology choices carry real consequences. Fat-tree is common. Dragonfly variants are appealing when operators want to cut cabling and control oversubscription. Fabric design stops being an implementation detail pretty quickly.

The software stack is equally unsurprising:

CUDA
NCCL with hierarchical collectives
PyTorch distributed stacks such as FSDP, ZeRO, tensor parallelism, sequence parallelism, and pipeline parallelism
CUDA Graphs to reduce launch overhead
Heavy use of fused kernels, especially around attention and optimizer steps

That's the current AI factory template. Everybody wants another 5 to 15 percent of utilization because, at this scale, a small efficiency gain is real money.

Giant clusters punish sloppy software

A lot of infrastructure coverage still treats compute as the main event and software as a footnote. At this size, that's backwards.

If you can't keep GPUs fed, the economics go sideways. And "fed" covers a lot:

the right parallelism strategy for the actual topology
stable low-precision training
enough checkpoint bandwidth to recover from failures without wrecking utilization
orchestration that respects placement and fault domains
data pipelines that don't starve expensive accelerators

Storage and I/O are a good example. For large language and multimodal training, a rough sustained demand of 0.5 to 2.0 GB/s per GPU is plausible depending on data format, preprocessing, and caching. Multiply that across thousands of devices and you're dealing with a storage architecture problem, not a side note.

That's why large clusters keep moving toward a layered design: local NVMe for hot access, some form of NVMe-oF or distributed filesystem for burst traffic and checkpoints, then object storage underneath. Checkpointing has to be hierarchical too. Write locally first, spill outward asynchronously. If every checkpoint blocks on shared storage, you've built your own outage machine.

Precision is another area where the gains are real but touchy. GB300-class accelerators are designed to make lower-precision paths pay, whether that's FP8 and in some cases FP4-style workflows where the stack supports it. Teams still need calibration runs, convergence testing, and guardrails around loss scaling and per-layer behavior. Training a little faster doesn't help if the run quietly degrades.

What engineering teams should take from it

If your team ends up targeting Azure-integrated capacity or similar Nscale-operated environments, a few practical points stand out.

Design for topology, not abstract GPU counts

An 8,000-GPU cluster can behave very differently depending on the wiring. Keep your biggest tensor-parallel groups inside a single NVLink island. Use pipeline parallelism and data parallelism to stretch beyond that boundary. Tune NCCL for the fabric you actually have, not the one from a vendor slide.

Treat data and checkpoints like core infrastructure

Teams love benchmarking flops and ignoring everything before and after each step. That gets expensive fast. Use sharded indexed datasets, aggressive prefetching, and asynchronous checkpoint upload. Test resume times under failure. Rack-level faults are normal at this scale.

Orchestration turns into internal politics

When GPU hours cost this much, queueing policy stops being an admin detail. Long-running training jobs, bursty fine-tuning work, and latency-sensitive inference all want the same hardware. Whether the environment lands on Kubernetes, Slurm, or a hybrid, placement and preemption policy need to be settled early.

Security and sovereignty carry more weight in Europe

The European sites are partly about green power, but they're also about data locality and regulation. That matters for enterprise AI workloads that can't move sensitive datasets across regions casually. Training will often stay local when policy says it must, even if another campus has cheaper power that week.

Why Microsoft wants this capacity

The deal also says something about Microsoft's position.

It already has one of the deepest AI infrastructure footprints in the market, and demand still seems to be outpacing comfortable supply. This gives Microsoft extra GPUs, broader geographic coverage, more energy diversification, and another route to next-gen Nvidia capacity without leaning on one buildout path.

It also lines up with the broader shift in the market. OpenAI's giant chip and power commitments helped push the new planning unit from servers and racks to gigawatts.

A lot of companies won't be able to play at this level. Capital matters, but it doesn't get you transmission access, liquid cooling expertise, supply chain priority, or a software platform that can absorb this much hardware without bleeding efficiency.

That's why these announcements keep pulling together cloud vendors, utilities, infrastructure funds, energy-rich regions, and AI companies. The stack has become stubbornly physical.

For developers, the takeaway is simple: the next wave of AI systems will be shaped as much by network design, storage throughput, and power envelopes as by model architecture. Code that assumes generic "GPU cloud" behavior is already behind the hardware.

What to watch

The harder part is not the headline capacity number. It is whether the economics, supply chain, power availability, and operational reliability hold up once teams try to use this at production scale. Buyers should treat the announcement as a signal of direction, not proof that cost, latency, or availability problems are solved.

Keep going from here

Useful next reads and implementation paths

If this topic connects to a real workflow, these links give you the service path, a proof point, and related articles worth reading next.

Relevant service

Data engineering and cloud

Build the data and cloud foundations that AI workloads need to run reliably.

Related proof

Cloud data pipeline modernization

How pipeline modernization cut reporting delays by 63%.

Microsoft says its first production Nvidia AI factory is now running in Azure

Microsoft just made a pointed infrastructure announcement. Satya Nadella says the company has deployed its first production Nvidia “AI factory” inside Azure, with more coming across Microsoft’s global data center footprint. The numbers are big enough...

Nvidia points to a new $200B AI infrastructure market

Jensen Huang has a new number for Wall Street: $200 billion. On Nvidia’s latest earnings call, after the company reported $81.6 billion in quarterly revenue and guided for $91 billion next quarter, Huang said Nvidia has found a “brand new...

Why AI data centers are clustering around gas power in Texas

The latest AI buildout in the US is converging on a blunt answer to a blunt constraint: large models need huge amounts of electricity, and gas is fast to deploy. That explains the wave of projects in Texas, Louisiana, and Tennessee. Poolside and Core...