Artificial Intelligence August 28, 2025

Nvidia Q1 revenue hits $46.7B as data center sales reach $41.1B

Nvidia reported $46.7 billion in revenue for the quarter, up 56% year over year. $41.1 billion came from data center. Net income reached $26.4 billion. The number that stands out for infrastructure teams is $27 billion of data center revenue from Bla...

Nvidia Q1 revenue hits $46.7B as data center sales reach $41.1B

Nvidia’s Blackwell quarter sets a new floor for AI infrastructure

Nvidia reported $46.7 billion in revenue for the quarter, up 56% year over year. $41.1 billion came from data center. Net income reached $26.4 billion.

The number that stands out for infrastructure teams is $27 billion of data center revenue from Blackwell. That’s enough to say the market has already moved. Blackwell is no longer a high-end option for a handful of labs. It’s becoming the default target for serious AI deployments.

Nvidia also highlighted a performance figure that matters more than the earnings beat: OpenAI’s open-source gpt-oss models ran at 1.5 million tokens per second on a single GB200 NVL72 rack-scale system.

That number says a lot about where inference is heading. Baseline assumptions from the Hopper era are aging fast.

A new operating baseline

Quarterly chip coverage tends to get pulled into stock chatter. For engineers, the useful signal is simpler.

If Blackwell generated $27 billion in one quarter, hyperscalers, cloud providers, and large model operators are already buying around a new baseline. That changes cloud SKUs, reference architectures, managed inference products, scheduling policy, software tuning, and procurement planning. The hardware that ships in volume ends up shaping everyone else’s stack.

Teams still estimating 2025 and 2026 capacity around Hopper-class limits are probably behind. They’re likely underestimating what large providers can serve per rack and overestimating the cost per token on older clusters.

Not everyone gets Blackwell economics right away. But the benchmark now comes from the buyers who do.

What GB200 NVL72 changes

The GB200 NVL72 is built to keep large-model inference inside a fast communication domain for as long as possible.

A simplified view:

  • A GB200 superchip pairs a Grace CPU with two Blackwell B200 GPUs using NVLink-C2C, reducing CPU-GPU movement overhead and tightening memory interaction.
  • NVL72 ties 72 GPUs into one NVLink domain through an NVLink switch fabric.
  • Each GPU gets over 1 TB/s of intra-rack bandwidth with fifth-generation NVLink.
  • At the rack level, that works out to roughly 14 TB of HBM if you assume about 192 GB HBM3e per GPU.

That matters because inference bottlenecks often come down to feeding the model, not raw FLOPS. Once context windows get large, KV cache becomes a memory problem first.

Throughput falls apart when the system keeps paying communication penalties. If activations, KV state, or expert-routing traffic spill onto slower links or bounce back through host memory, performance drops quickly. NVL72 is attractive because it keeps a huge amount of that traffic on a very fast local fabric.

That’s why the 1.5 million tokens per second figure is plausible. Divide it by 72 GPUs and you get a rough average of about 20,800 tokens/s per GPU under highly optimized, high-concurrency conditions. That’s a rack aggregate under load, not a per-request latency number. For production economics, aggregate throughput still matters a lot.

Hitting that number takes a lot of tuning

Nobody gets to 1.5 million tokens per second by dropping a model onto default settings.

You need an aggressively tuned stack:

  • Heavy batching across many streams to keep tensor cores busy
  • KV-cache placement and sharding that keeps hot state in HBM and avoids expensive detours
  • Kernel fusion and graph optimization through TensorRT-LLM
  • Likely some form of speculative decoding or assisted generation if the serving path supports it
  • Topology-aware parallelism so tensor, pipeline, and sequence splits match the rack’s communication fabric

This is where a lot of loose AI infra talk breaks down. Peak rack throughput is useful, but it doesn’t map neatly to product performance. Interactive workloads have different constraints. p95 and p99 latency still matter. So do request variability and prompt length distribution.

A rack can post great throughput numbers and still perform poorly on low-latency, bursty user traffic if scheduling is sloppy.

Engineers should read Nvidia’s inference claim as a new upper bound. Not a default expectation.

Lower precision helps, with the usual caveats

Blackwell pushes lower precision hard, especially FP4 for inference and improved FP8 handling for training through Nvidia’s updated Transformer Engine.

The upside is straightforward. Lower precision cuts memory use and raises throughput. For serving, that goes straight to cost. For very large models, it can be the difference between fitting cleanly in HBM and building around awkward workarounds.

The downside is familiar: quality drift, calibration work, and layer-specific exceptions. Some models quantize cleanly. Some don’t. A lot of teams will end up running most of the graph in low precision while keeping sensitive layers in BF16 or FP16.

That’s manageable. It also means “supports FP4” does not mean “run everything in FP4.” Teams that skip proper evals will ship regressions and find out from users.

Better inference economics depend on utilization

A number like 1.5 million tokens per second changes expectations for cost per million tokens, at least where utilization stays high.

That qualifier matters.

If you can keep the rack full, manage memory well, and keep the serving path stable, unit economics improve sharply. If your workload is spiky, prompts are huge, tenants are noisy, and the serving layer is messy, the gap between theoretical and actual performance closes fast.

A lot of enterprise teams are about to run into the dull version of this problem: buying powerful hardware is easier than operating it well.

TensorRT-LLM, Triton, NCCL, CUDA, and Nvidia’s surrounding tooling are a big reason Blackwell is moving into production so quickly. The hardware is strong. The software moat is stronger.

Power and cooling are now hard limits

There’s a blunt physical constraint here.

An NVL72-class rack can draw roughly 100 to 120 kW or more and usually assumes liquid cooling. At that point, the bottleneck isn’t just access to chips. It’s power delivery, thermal design, water loops, floor planning, and whether the facility can handle that density without months of rework.

That changes who gets to deploy at the frontier. Budget helps, but so does having data center capacity ready for this class of hardware.

For smaller buyers, cloud access to Blackwell matters more than ownership. Renting the architecture is often the practical move.

China is still constrained

Nvidia’s China situation remains messy. The company said there were zero H20 shipments into China last quarter. The reported export-tax arrangement, including the 15% payment to the U.S. Treasury, still lacks final regulatory certainty, and Chinese authorities have reportedly discouraged use of Nvidia chips.

So demand from China is still limited. Near term, that doesn’t look catastrophic because demand elsewhere is strong enough to absorb supply. Longer term, it pushes the market further toward split ecosystems: CUDA-first infrastructure on one side, domestic or alternate accelerator stacks on the other.

That’s bad for software portability. Multinational teams should expect more pressure to keep models, runtimes, and deployment tooling adaptable across different hardware targets.

What engineering teams should do now

A few implications are already clear.

First, revisit planning assumptions. Hopper-era models for throughput, memory pressure, and rack design are aging quickly.

Second, treat KV cache as a first-class systems problem. Long-context inference often becomes memory-bound before it becomes compute-bound. Paged KV cache, smart sharding, and careful placement across NVLink-connected GPUs matter.

Third, measure latency under realistic load, not just peak tokens per second. Aggregate throughput looks great on a slide. Production traffic is usually uglier.

Fourth, plan around topology, especially in mixed-generation fleets. Blackwell and Hopper in the same environment can create nasty scheduling variance if your orchestrator treats them as interchangeable.

Fifth, get serious about observability. GPU utilization alone is not enough. Track tokens per second, cache hit rates, acceptance rates for speculative decoding, p95 and p99 latency, and interconnect-aware placement. DCGM plus app-level telemetry should be standard.

Blackwell has already moved past the launch phase. It’s becoming an operating assumption. Nvidia’s revenue was huge. The infrastructure shift matters more.

Keep going from here

Useful next reads and implementation paths

If this topic connects to a real workflow, these links give you the service path, a proof point, and related articles worth reading next.

Relevant service
Data engineering and cloud

Build the data and cloud foundations that AI workloads need to run reliably.

Related proof
Cloud data pipeline modernization

How pipeline modernization cut reporting delays by 63%.

Related article
Microsoft says its first production Nvidia AI factory is now running in Azure

Microsoft just made a pointed infrastructure announcement. Satya Nadella says the company has deployed its first production Nvidia “AI factory” inside Azure, with more coming across Microsoft’s global data center footprint. The numbers are big enough...

Related article
Andy Jassy's shareholder letter makes Amazon's $200 billion infrastructure case

Andy Jassy’s annual shareholder letter is meant for investors. This year, it also reads like a broad challenge to the infrastructure market. Amazon says it plans to spend $200 billion in capex in 2026, and Jassy uses the letter to defend that number ...

Related article
OpenAI's Abu Dhabi data center plan points to AI infrastructure at national scale

OpenAI is reportedly planning a data center campus in Abu Dhabi with a projected 5 gigawatt power envelope across roughly 10 square miles. By normal data center standards, that number is wild. At 5 GW, this stops looking like a big cloud region and s...