Is setting torch.manual_seed sufficient for reproducible LLM inference?

No, seed settings don’t address nondeterminism in GPU kernels, floating-point operations, or runtime scheduling.

What common sources of nondeterminism affect LLM outputs?

Floating-point math precision, reduction order, atomic ops, dynamic batching, and GPU runtime decisions.

When is reproducible LLM inference most important?

In regulated, customer-facing, or production-critical systems requiring consistent and repeatable AI outputs.

Llm September 12, 2025

Thinking Machines Lab targets reproducible LLM inference under Mira Murati

Thinking Machines Lab, the startup led by former OpenAI CTO Mira Murati, has laid out an early but serious technical goal: make LLM inference reproducible. That sounds narrow compared with the usual model-company pitch. It’s also one of the more prac...

Thinking Machines Lab is chasing deterministic LLMs, and that could fix a real production headache

Thinking Machines Lab, the startup led by former OpenAI CTO Mira Murati, has laid out an early but serious technical goal: make LLM inference reproducible.

That sounds narrow compared with the usual model-company pitch. It’s also one of the more practical ideas you can chase right now.

In a new research post, “Defeating Nondeterminism in LLM Inference,” Horace He argues that a lot of the variance people blame on “LLMs being random” actually comes from lower in the stack. Not just decoding settings or seeds. GPU kernels, reduction order, collective ops, mixed-precision paths, runtime scheduling. The systems mess most product demos politely ignore.

If that layer can be controlled tightly enough, identical prompts under identical conditions should produce identical outputs. For teams putting AI into regulated, customer-facing, or production-critical systems, that matters.

Why this matters

A lot of AI software still treats variability as background noise. Sometimes that’s fine. If you sample with temperature=0.8, variation is the point.

But many production uses of LLMs want repeatability, not creativity.

If your support copilot gives slightly different policy answers for the same internal question, you have a problem. If your legal review assistant can’t reproduce the exact output that triggered an escalation last week, that’s a problem too. If your evaluation pipeline shifts because the runtime chose a different kernel path, that’s a quieter failure, but still a failure.

Teams have learned to live with the fact that even with temperature=0, the same model can still produce different tokens across runs. People often treat that as inherent. Thinking Machines is arguing that a lot of it is inference-stack slop.

That’s believable. Anyone who’s spent time in ML infrastructure knows “seeded” and “reproducible” stop meaning the same thing once you’re deep in GPU execution, distributed math, and optimized kernels.

Where the nondeterminism comes from

There are two different sources of variation, and they get mixed together constantly.

One is intentional randomness. Sampling methods such as top_p, top_k, temperature scaling, beam-search tie-breaking, and speculative decoding can introduce variation by design.

The other is unintentional nondeterminism in the numerical and systems layer. That’s the part Thinking Machines is targeting.

Floating-point math is the first culprit. Addition isn’t associative in finite precision, so if parallel workers sum partial values in a different order, you can get slightly different answers. In transformer inference, those tiny shifts can cascade. A small change in attention scores or normalization can alter which token barely wins. After that, generation diverges.

Then there’s the GPU runtime:

Libraries such as cuBLAS, cuDNN, and FlashAttention variants may choose different algorithms depending on heuristics, workspace, or hardware state.
Atomics and unsynchronized writes can produce run-to-run differences.
Multi-GPU collectives like all_reduce and reduce_scatter can vary based on rank ordering, topology, and runtime decisions.
Fused kernels change evaluation order.
Dynamic batching and KV-cache paging can alter execution timing and scheduling.
TF32, FP16, and BF16 paths can produce architecture-specific rounding behavior.
Driver or CUDA version drift can change numerics without any model weight changes.

So no, reproducibility in LLMs is not a matter of setting torch.manual_seed(42) and moving on.

The hard part is keeping the math path stable across kernels, devices, libraries, and time.

That’s systems work. It’s also where a lot of production reliability gets won or lost.

What Thinking Machines appears to be building

The company hasn’t announced a full product yet, but it says a first release for researchers and startups is coming in the next few months. The direction is clear enough.

A deterministic inference stack would need control over several layers at once:

fixed kernel selection
fixed execution ordering
deterministic reduction strategies
locked collective behavior across GPUs
stable precision settings
deterministic decoding on top

In practice, that probably means giving up some of the “fastest possible” behavior modern inference stacks chase.

Autotuning exists for good reasons. Dynamic batching exists for good reasons. Aggressive fusion exists for good reasons. They improve throughput and hardware utilization. They also make execution less stable and harder to inspect.

A deterministic stack likely has to trade some of that away.

That’s what makes this interesting as a company strategy. Thinking Machines is betting that inference architecture itself can be a product differentiator. If it can offer reproducible responses under controlled conditions, that becomes a concrete selling point for enterprise buyers that care about audit trails, regression testing, and contract-level guarantees.

You can easily imagine “deterministic inference profile” showing up in vendor evaluations.

Determinism usually costs speed

This isn’t free.

Deterministic reductions can be slower. Fixed algorithms can lag autotuned ones. Pinned collective order gets harder as you scale across more GPUs. Cross-node determinism is worse. Add dynamic request traffic from real users and stable scheduling gets harder than any lab setup suggests.

There’s also a ceiling here. Determinism under tightly controlled conditions is one thing. Determinism in a large multi-tenant inference service with elastic scaling, rolling upgrades, mixed GPU fleets, and latency targets is another.

So if Thinking Machines gets this working in a clean product, that’s impressive. If the guarantee only holds on a frozen hardware and software stack, it’s still useful, but narrower than the headline promise suggests.

That distinction matters. A claim of “reproducible inference” needs fine print:

same model weights
same prompt and decoding config
same hardware class
same driver and CUDA stack
same distributed topology
same runtime build
same batch behavior, if batching affects execution order

Without those constraints, determinism turns into marketing pretty quickly.

Why RL and customization may benefit even more

One of the sharper points in the Thinking Machines write-up is the claim that deterministic inference can make reinforcement learning smoother.

That’s worth paying attention to.

RL pipelines already have enough variance. If the model itself jitters because the inference path changes underneath it, reward estimates get noisier, run-to-run comparisons get weaker, and debugging gets ugly fast. Stable inference won’t fix RL’s underlying mess, but it can remove one source of entropy teams don’t want.

That matters if Thinking Machines plans to use RL to customize models for businesses, which the source material suggests. Enterprise customization depends on iteration speed and predictability. If every training or evaluation pass includes extra numerical wobble from the serving stack, you burn compute trying to separate actual model changes from runtime noise.

Reducing that noise is boring work in the best sense. It makes the rest of the pipeline easier to trust.

What developers can do now

Most teams can’t control kernels at the level Thinking Machines is describing. Still, if reproducibility matters, there are obvious steps worth taking.

Start with decoding. Use greedy generation when you need stable outputs: temperature=0, top_p=1.0, do_sample=False. If you use beam search, lock down tie-breaking and avoid stochastic penalties.

Then freeze your environment harder than you probably do today. Containerize. Pin driver versions, CUDA, cuDNN, cuBLAS, framework versions, compiler settings, and GPU SKU. Record build hashes. Treat environment drift as a real change, not background noise.

Framework-level deterministic switches can help, though they won’t save you on their own. In PyTorch, settings like:

torch.use_deterministic_algorithms(True)
torch.backends.cudnn.deterministic = True

can reduce nondeterministic behavior, with the usual downside that some ops get slower or disappear.

Also worth doing:

disable opportunistic precision changes like TF32 if you need tighter reproducibility
keep inference topology stable across runs
avoid hidden batching differences during evaluation
log seeds, prompt templates, model IDs, build metadata, and hardware details
test on the exact deployment configuration, not a roughly similar dev box

If you care about audits or regression tests, reproducibility is a systems property. Prompt engineering won’t fix it.

The broader industry angle

This sounds like a small problem until it bites you.

The AI industry loves benchmark numbers, eval dashboards, and claims of reliable model behavior. Underneath that, plenty of those comparisons are noisier than anyone admits. If deterministic inference becomes practical, benchmarking gets cleaner, incident review gets easier, and regressions get easier to pinpoint.

It would also put pressure on open-source tooling. If Thinking Machines publishes techniques or code that move the needle, frameworks and inference engines will have to respond. Some already expose deterministic modes, but the current state is fragmented and incomplete. There’s room for a real standard here.

For regulated sectors, this could matter even sooner than it does for general-purpose AI apps. Finance, healthcare, insurance, government. Any setting where you may need to reproduce an output months later for an audit or dispute. Today, many teams can log prompts and outputs, but reproducing the generation path exactly is often shaky. That’s a weak foundation for serious systems.

Thinking Machines is going after a hard problem that plenty of companies quietly care about and few talk about publicly. That alone makes it worth watching.

If it can turn deterministic inference from a research goal into a product guarantee, that would be more useful than another model launch. It would be infrastructure people can build on without wondering whether the same prompt will quietly change its mind next week.

What to watch

The risk is overreading early technical progress as operational proof. In scientific or health-adjacent settings, reliability, validation, data quality, and expert review matter more than a clean product story. The useful question is where the system reduces friction without weakening accountability.

Keep going from here

Useful next reads and implementation paths

If this topic connects to a real workflow, these links give you the service path, a proof point, and related articles worth reading next.

Relevant service

Data engineering and cloud

Build the data and cloud foundations that AI workloads need to run reliably.

Related proof

Cloud data pipeline modernization

How pipeline modernization cut reporting delays by 63%.

ChatGPT after GPT-5: OpenAI shifts from a model to a routed stack

OpenAI is no longer selling ChatGPT as a single flagship model story. GPT-5 is the headline, sure. The more important shift is the stack around it. ChatGPT now looks like a routed system with multiple performance tiers, multiple underlying models, ag...

OpenAI's GPT-5 roadmap points to a more flexible release strategy

OpenAI gave a clearer picture of GPT-5 this week. The notable part is the release strategy. The company is adjusting it in public. Sam Altman said OpenAI has been working on GPT-4.5 for nearly two years. He also said GPT-5 ended up more capable than ...

OpenAI says AI hallucinations persist because models are rewarded for guessing

OpenAI’s latest research makes a blunt point: large language models keep making things up because the industry still rewards them for guessing. That sounds obvious, but it cuts against how many models are built, tuned, and benchmarked. The standard s...