What is the purpose of the Datasets library?

It simplifies large-scale data ingestion and preprocessing with a consistent API and built-in optimizations.

How does an open AI stack reduce vendor lock-in?

By offering interchangeable serving engines and standard APIs, it lets teams route requests across backends without rewriting their apps.

Can open models match closed models in performance?

For many domain-specific tasks, open models plus adaptation pipelines now reach acceptable quality at lower cost.

Generative AI September 20, 2025

Thomas Wolf on open AI infrastructure and what developers actually use

TechCrunch Disrupt 2025 put Thomas Wolf onstage because he’s spent years turning "open AI" into actual software that developers use. Wolf, Hugging Face’s co-founder and chief science officer, has been tied to some of the most important infrastructure...

Thomas Wolf’s open AI argument still holds: the stack matters more than the model

TechCrunch Disrupt 2025 put Thomas Wolf onstage because he’s spent years turning "open AI" into actual software that developers use.

Wolf, Hugging Face’s co-founder and chief science officer, has been tied to some of the most important infrastructure in modern machine learning: transformers, datasets, safetensors, model cards, and the BigScience project behind BLOOM. Those projects changed how teams build with models, and who gets to build with them.

The timing matters. AI has concentrated around a small number of labs, clusters, and billing relationships. At the same time, open models have improved, serving stacks have matured, and the tooling gap has narrowed enough that serious teams can choose control over dependence. For plenty of production workloads, that’s a practical decision now, not an ideological one.

Why engineers care

Most developers don’t care much about "openness" as a slogan. They care whether they can inspect a system, reproduce it, tune it, secure it, and ship it without waiting on a vendor.

That has been Wolf’s argument for years.

An open AI stack gives teams control over the path from dataset ingestion to inference endpoints. That includes provenance, tokenization, training configs, weight formats, evaluation harnesses, guardrails, and deployment. Leave any one of those layers opaque and you inherit somebody else’s assumptions, and their failure modes.

That’s why Hugging Face’s biggest wins weren’t limited to model releases. The important work was in the boring infrastructure that removed friction across the stack.

transformers made research architectures usable for production teams
datasets made large-scale data ingestion and preprocessing less painful
safetensors addressed a real supply-chain risk with a safer, faster format than Python pickle-based loading
model cards and licensing conventions pushed documentation and governance into the release process

That mix matters more than another benchmark screenshot.

Open models are good enough for real work

For a while, the open-model debate kept collapsing into one question: can they match the best closed models? Sometimes yes, often no, depending on the task.

For many teams, that’s still the wrong starting point.

If you’re building code assistants, internal copilots, document extraction, multilingual search, compliance workflows, or domain-specific agents, the useful question is whether an open base model plus your own adaptation pipeline gets you to acceptable quality at acceptable cost. More often now, it does.

That gets clearer once you factor in:

on-prem or VPC deployment
lower steady-state inference costs
domain tuning with LoRA or QLoRA
tighter data handling in regulated environments
the ability to swap serving engines or model families without rewriting your app contract

OpenAI-compatible REST APIs across open serving stacks have helped a lot. Teams can put multiple backends behind roughly the same interface and route requests by latency, cost, or task fit. That weakens vendor lock-in in a very literal way.

There’s still a quality gap at the high end. Closed frontier labs usually do better on broad reasoning, multimodal polish, and operational scale. But the gap is no longer wide enough to rule out open systems by default. That’s the change.

The hard parts are lower in the stack

People argue about model weights because they’re visible. The nastier engineering problems sit underneath.

Data is still the messiest layer

Any team building a serious open stack runs into data quality and governance first. That means deduplication beyond exact string matches, PII scrubbing, provenance tracking, and a sane way to preserve raw versus filtered datasets so auditability doesn’t disappear.

The datasets ecosystem helps, especially with Arrow-backed pipelines and streaming from Parquet or WebDataset. That avoids some ugly I/O bottlenecks once your corpora stop fitting neatly on local disks. But the tooling doesn’t answer the hard questions. Engineers still have to decide how aggressive toxicity filters should be, which sources are legally usable, and how to document takedown or opt-out paths.

That sounds administrative right up until legal or security gets involved. Then it becomes architecture.

Training efficiency matters

The training recipe is clearer now: start with a solid open base model, fine-tune cheaply where you can, and save full retrains for the cases that actually justify them.

That’s why parameter-efficient methods like LoRA and QLoRA matter. They often recover most of the value of full fine-tuning at a fraction of the compute and memory cost. For enterprise teams, that can be the difference between experimentation and a real deployment pipeline.

For larger runs, the stack is getting pretty standard:

FSDP or ZeRO-3 for memory sharding
tensor or pipeline parallelism where needed
fused kernels such as FlashAttention-3 to keep GPUs busy
FP16 or FP8 on Hopper- and Blackwell-class hardware
safetensors checkpoints for safer and faster weight handling

None of this is glamorous. All of it decides whether a training job finishes on time or burns money while stalling on memory pressure and Python overhead.

Alignment has shifted too. A lot of teams now prefer DPO over a full RLHF pipeline when the goal is steerability and preference shaping without building out a large reinforcement learning stack. That trade-off is sensible. RLHF still has its place, but it’s expensive, fiddly, and often too much for product teams that mostly want better refusal behavior and response tone.

Inference is where open AI gets practical

Serving is the part of the open stack that’s become hard to dismiss.

A few years ago, deploying large open models usually meant custom engineering, fragile performance, or both. Now there are several competent options, each with clear trade-offs:

vLLM for high-throughput generation, especially with PagedAttention
Hugging Face TGI for production-friendly text generation serving
TensorRT-LLM when you want maximum GPU efficiency and you’re willing to tune for Nvidia’s stack
llama.cpp and GGUF for CPU and edge deployments that would have sounded unrealistic not long ago

Quantization matters here too. Techniques like AWQ and GPTQ make 4-bit and 8-bit serving viable for many workloads, though teams still need to test task accuracy instead of trusting a generic perplexity number. Some layers quantize badly. Long-context behavior can degrade in ways that only show up under real traffic.

Speculative decoding is another meaningful shift. It cuts latency when quick first-token response matters, and it fits well with bursty production workloads. Combined with KV cache reuse and decent observability, it changes the economics of interactive applications.

That last part gets ignored too often. If you’re running open inference in production, you want token-level latency, sequence lengths, cache hit rates, refusal rates, and per-request cost attribution. Without that, you’re operating blind.

Openness creates work, and options

There’s a lazy version of the open-model pitch that treats openness as free performance or free independence. It isn’t.

If you own more of the system, you own more of the mess. Security hardening, artifact signing, weight provenance, abuse monitoring, evaluation drift, model updates, infrastructure tuning, and compliance mapping all land on your team. Closed APIs abstract a lot of that away. They also obscure trade-offs you may care about later.

That’s the split technical leaders have to think through.

If you have low volume, limited infrastructure support, or no appetite for operating model serving, closed APIs are often the sensible choice. They’re faster to adopt and easier to defend internally.

If you have steady usage, sensitive data, strict latency targets, or deep customization needs, the economics can swing toward open systems pretty fast, especially once egress, token pricing, and vendor roadmap risk show up.

Wolf’s argument has held up well. The durable advantage isn’t a single checkpoint. It’s the stack around it, the standards around that stack, and the fact that thousands of engineers can inspect, improve, and replace parts of it without asking permission.

Senior teams should take that seriously. Pick models carefully. Spend at least as much time on your data pipeline, serving layer, checkpoint format, eval setup, and governance model.

Those decisions will still matter after the benchmark war has moved on.

Keep going from here

Useful next reads and implementation paths

If this topic connects to a real workflow, these links give you the service path, a proof point, and related articles worth reading next.

Relevant service

RAG and AI systems

Use open and commercial models where they fit, with evaluation and deployment controls.

Related proof

Internal docs RAG assistant

How grounded retrieval made internal knowledge easier to use.

TechCrunch Disrupt 2025 puts AI infrastructure and applications on one stage

TechCrunch Disrupt 2025 is putting two parts of the AI market next to each other, and the pairing makes sense. One is Greenfield Partners with its “AI Disruptors 60” list, a snapshot of startups across AI infrastructure, applications, and go-to-marke...

Hugging Face CEO on the LLM bubble and why AI may hold up better

Clem Delangue, the CEO of Hugging Face, said this week that we’re in an LLM bubble, not an AI bubble, and that he expects it to start deflating next year. The distinction matters. If he’s right, the damage won’t spread evenly across AI. It’ll hit the...

Moonbounce raises $12M to build a real-time moderation layer for AI

Moonbounce, a startup founded by former Facebook and Apple trust and safety leader Brett Levenson and Ash Bhardwaj, has raised $12 million to sell a specific piece of infrastructure: a real-time moderation layer that sits between users and AI systems...