What is GPT-OSS-20B in Windows AI Foundry?

It’s a 20B-parameter text transformer model integrated for local inference on Windows 11.

How much GPU memory is needed to run GPT-OSS-20B locally?

A minimum of 16 GB VRAM on GPUs like NVIDIA RTX 30-series or AMD Radeon RX 6000.

Can I deploy GPT-OSS-20B workflows in the cloud?

Yes, via Azure AI Foundry or AWS Marketplace mirrors for scalable hosted inference.

Llm August 6, 2025

Windows 11 AI Foundry adds GPT-OSS-20B for local inference on PC

Microsoft has added OpenAI’s GPT-OSS-20B to Windows AI Foundry on Windows 11. For developers, that means a 20B-parameter reasoning model can now run locally on a Windows box with a decent GPU instead of sitting behind an API call. That changes the pr...

Microsoft puts GPT-OSS-20B on Windows 11, and that changes who gets to build local AI

Microsoft has added OpenAI’s GPT-OSS-20B to Windows AI Foundry on Windows 11. For developers, that means a 20B-parameter reasoning model can now run locally on a Windows box with a decent GPU instead of sitting behind an API call.

That changes the practical options for teams building internal AI tools.

Local inference has been stuck in the middle for a while. Small models run anywhere but struggle once the task gets messy. Large models bring cloud cost, network latency, and the usual security and legal review. GPT-OSS-20B lands in a more usable range. It’s small enough for high-end consumer hardware and still aimed at tool-using workflows like web search, Python execution, and chained task completion.

That matters more than another benchmark screenshot.

Why Windows matters here

The interesting part isn’t that OpenAI has a model that runs locally. Plenty of models already do. The interesting part is that Microsoft is putting it on a native Windows deployment path through Windows AI Foundry, backed by ONNX Runtime and DirectML.

That cuts setup friction.

A lot of internal software still lives on Windows. Same for desktop assistants, support tools, IDE helpers, and line-of-business apps that need to stay inside a company network. A model that runs locally on Windows 11 without a custom inference stack is immediately more useful than a lot of open model demos.

Microsoft is also giving teams two lanes. The same model family is available through Azure AI Foundry for hosted use, with AWS Marketplace mirrors as well. The pitch is straightforward: prototype locally, keep sensitive work on-device, move heavier jobs to the cloud when scale matters.

That’s a sensible lifecycle. It’s also a tidy way to keep developers inside Microsoft’s stack.

GPT-OSS-20B by the numbers

Per the release details, GPT-OSS-20B is a text-only transformer with:

20 billion parameters
64 transformer layers
16 attention heads per layer
2,048-token context window

OpenAI says it was trained with standard pretraining plus RLHF to improve reasoning and tool-use behavior. The practical claim is simple: the model is supposed to handle multi-step tasks, pick the right tool, and carry the result forward with some consistency.

That’s the part worth watching. Local models often look fine in a chat window and then break once they have to do actual work inside a workflow. GPT-OSS-20B is clearly aimed at agentic use, which usually means some mix of:

deciding when to call a tool
formatting the tool request correctly
reading the result
using it in a follow-up step
stopping at the right time

Windows AI Foundry exposes it through familiar plumbing. The reference path uses ONNX plus DmlExecutionProvider, so GPU inference runs through DirectML instead of a CUDA-only setup.

import onnxruntime as ort

session = ort.InferenceSession(
"gpt_oss_20b.onnx",
providers=["DmlExecutionProvider"]
)
inputs = {"input_ids": token_ids, "attention_mask": mask}
outputs = session.run(None, inputs)
generated = decode(outputs[0])

That broadens the hardware story. Microsoft says the model runs on Windows 11 systems with at least 16 GB of VRAM, including cards in the NVIDIA RTX 30-series and AMD Radeon RX 6000 class. That’s still a real hardware floor, but it’s a much larger audience than the usual data center-only setup.

Tool orchestration is the real use case

Treating this as a local ChatGPT replacement misses the point.

The better fit is a bounded workflow where the model acts as a controller over tools. Think:

an internal assistant that queries a local SQLite store
a support tool that searches docs, extracts the answer, and drafts a reply
a dev utility that writes Python, runs it in a sandbox, and summarizes the output
an offline field app in a low-bandwidth environment

The Microsoft example is basic and useful: fetch an exchange rate through web search, then run Python to calculate the result. That’s ordinary agent behavior. Ordinary is fine. Most production automation doesn’t need a general-purpose oracle. It needs a model that can call the right thing, parse the result, and stay inside the rails.

Local deployment helps for obvious reasons. If the workflow depends on company data, proprietary scripts, internal databases, or systems that shouldn’t touch the public internet, cloud APIs can be a non-starter. Running inference on the device or inside the org perimeter cuts a lot of the pain around data residency and auditability.

Not all of it. Still a lot.

The limits are real

This setup comes with hard constraints.

The model is text-only. The context window is 2,048 tokens, which is tight by current standards. And the local hardware requirement is still substantial at 16 GB of VRAM. That rules out a lot of laptops and lower-end desktops.

There’s also the usual warning on tool-using models: a demo that can search the web and run Python is not the same thing as a production-ready agent. Reliability, sandboxing, logging, and guardrails still matter. Local deployment helps with privacy and control, but it doesn’t remove the need for careful system design.

Even with those limits, this is a meaningful release. A 20B OpenAI model with a first-party Windows path lowers the barrier for teams that want local AI without building the stack from scratch. For a lot of enterprise developers, that’s the difference between an experiment and something they can actually ship.

What to watch

The main caveat is that an announcement does not prove durable production value. The practical test is whether teams can use this reliably, measure the benefit, control the failure modes, and justify the cost once the initial novelty wears off.

Keep going from here

Useful next reads and implementation paths

If this topic connects to a real workflow, these links give you the service path, a proof point, and related articles worth reading next.

Relevant service

AI model evaluation and implementation

Compare models against real workflow needs before wiring them into production systems.

Related proof

Internal docs RAG assistant

How model-backed retrieval reduced internal document search time by 62%.

Mistral AI in 2026: from OpenAI rival to full-stack model platform

Mistral AI still gets framed as a European OpenAI rival. That's accurate, but dated. The latest updates show a company building across the stack: a consumer assistant with long-term memory, a wider frontier model lineup, open-weight coding and edge m...

Cohere launches Tiny Aya, open multilingual models for local use

Cohere has launched Tiny Aya, a family of open-weight multilingual models built to run locally across 70-plus languages. That’s useful on its own. What makes the release interesting is the mix of constraints it’s aiming at: small enough for ordinary ...

OpenAI Agents SDK adds sandboxed workspaces for safer enterprise agents

OpenAI has updated its Agents SDK with two features enterprise teams have been asking for: sandboxed workspaces and a supported runtime stack for long-running agents. That may sound like plumbing. It is. It’s also the part that usually breaks once an...