Thomas Wolf on open AI infrastructure and what developers actually use
TechCrunch Disrupt 2025 put Thomas Wolf onstage because he’s spent years turning "open AI" into actual software that developers use. Wolf, Hugging Face’s co-founder and chief science officer, has been tied to some of the most important infrastructure...
Thomas Wolf’s open AI argument still holds: the stack matters more than the model
TechCrunch Disrupt 2025 put Thomas Wolf onstage because he’s spent years turning "open AI" into actual software that developers use.
Wolf, Hugging Face’s co-founder and chief science officer, has been tied to some of the most important infrastructure in modern machine learning: transformers, datasets, safetensors, model cards, and the BigScience project behind BLOOM. Those projects changed how teams build with models, and who gets to build with them.
The timing matters. AI has concentrated around a small number of labs, clusters, and billing relationships. At the same time, open models have improved, serving stacks have matured, and the tooling gap has narrowed enough that serious teams can choose control over dependence. For plenty of production workloads, that’s a practical decision now, not an ideological one.
Why engineers care
Most developers don’t care much about "openness" as a slogan. They care whether they can inspect a system, reproduce it, tune it, secure it, and ship it without waiting on a vendor.
That has been Wolf’s argument for years.
An open AI stack gives teams control over the path from dataset ingestion to inference endpoints. That includes provenance, tokenization, training configs, weight formats, evaluation harnesses, guardrails, and deployment. Leave any one of those layers opaque and you inherit somebody else’s assumptions, and their failure modes.
That’s why Hugging Face’s biggest wins weren’t limited to model releases. The important work was in the boring infrastructure that removed friction across the stack.
transformersmade research architectures usable for production teamsdatasetsmade large-scale data ingestion and preprocessing less painfulsafetensorsaddressed a real supply-chain risk with a safer, faster format than Python pickle-based loading- model cards and licensing conventions pushed documentation and governance into the release process
That mix matters more than another benchmark screenshot.
Open models are good enough for real work
For a while, the open-model debate kept collapsing into one question: can they match the best closed models? Sometimes yes, often no, depending on the task.
For many teams, that’s still the wrong starting point.
If you’re building code assistants, internal copilots, document extraction, multilingual search, compliance workflows, or domain-specific agents, the useful question is whether an open base model plus your own adaptation pipeline gets you to acceptable quality at acceptable cost. More often now, it does.
That gets clearer once you factor in:
- on-prem or VPC deployment
- lower steady-state inference costs
- domain tuning with
LoRAorQLoRA - tighter data handling in regulated environments
- the ability to swap serving engines or model families without rewriting your app contract
OpenAI-compatible REST APIs across open serving stacks have helped a lot. Teams can put multiple backends behind roughly the same interface and route requests by latency, cost, or task fit. That weakens vendor lock-in in a very literal way.
There’s still a quality gap at the high end. Closed frontier labs usually do better on broad reasoning, multimodal polish, and operational scale. But the gap is no longer wide enough to rule out open systems by default. That’s the change.
The hard parts are lower in the stack
People argue about model weights because they’re visible. The nastier engineering problems sit underneath.
Data is still the messiest layer
Any team building a serious open stack runs into data quality and governance first. That means deduplication beyond exact string matches, PII scrubbing, provenance tracking, and a sane way to preserve raw versus filtered datasets so auditability doesn’t disappear.
The datasets ecosystem helps, especially with Arrow-backed pipelines and streaming from Parquet or WebDataset. That avoids some ugly I/O bottlenecks once your corpora stop fitting neatly on local disks. But the tooling doesn’t answer the hard questions. Engineers still have to decide how aggressive toxicity filters should be, which sources are legally usable, and how to document takedown or opt-out paths.
That sounds administrative right up until legal or security gets involved. Then it becomes architecture.
Training efficiency matters
The training recipe is clearer now: start with a solid open base model, fine-tune cheaply where you can, and save full retrains for the cases that actually justify them.
That’s why parameter-efficient methods like LoRA and QLoRA matter. They often recover most of the value of full fine-tuning at a fraction of the compute and memory cost. For enterprise teams, that can be the difference between experimentation and a real deployment pipeline.
For larger runs, the stack is getting pretty standard:
FSDPorZeRO-3for memory sharding- tensor or pipeline parallelism where needed
- fused kernels such as FlashAttention-3 to keep GPUs busy
- FP16 or FP8 on Hopper- and Blackwell-class hardware
safetensorscheckpoints for safer and faster weight handling
None of this is glamorous. All of it decides whether a training job finishes on time or burns money while stalling on memory pressure and Python overhead.
Alignment has shifted too. A lot of teams now prefer DPO over a full RLHF pipeline when the goal is steerability and preference shaping without building out a large reinforcement learning stack. That trade-off is sensible. RLHF still has its place, but it’s expensive, fiddly, and often too much for product teams that mostly want better refusal behavior and response tone.
Inference is where open AI gets practical
Serving is the part of the open stack that’s become hard to dismiss.
A few years ago, deploying large open models usually meant custom engineering, fragile performance, or both. Now there are several competent options, each with clear trade-offs:
vLLMfor high-throughput generation, especially withPagedAttention- Hugging Face
TGIfor production-friendly text generation serving TensorRT-LLMwhen you want maximum GPU efficiency and you’re willing to tune for Nvidia’s stackllama.cppandGGUFfor CPU and edge deployments that would have sounded unrealistic not long ago
Quantization matters here too. Techniques like AWQ and GPTQ make 4-bit and 8-bit serving viable for many workloads, though teams still need to test task accuracy instead of trusting a generic perplexity number. Some layers quantize badly. Long-context behavior can degrade in ways that only show up under real traffic.
Speculative decoding is another meaningful shift. It cuts latency when quick first-token response matters, and it fits well with bursty production workloads. Combined with KV cache reuse and decent observability, it changes the economics of interactive applications.
That last part gets ignored too often. If you’re running open inference in production, you want token-level latency, sequence lengths, cache hit rates, refusal rates, and per-request cost attribution. Without that, you’re operating blind.
Openness creates work, and options
There’s a lazy version of the open-model pitch that treats openness as free performance or free independence. It isn’t.
If you own more of the system, you own more of the mess. Security hardening, artifact signing, weight provenance, abuse monitoring, evaluation drift, model updates, infrastructure tuning, and compliance mapping all land on your team. Closed APIs abstract a lot of that away. They also obscure trade-offs you may care about later.
That’s the split technical leaders have to think through.
If you have low volume, limited infrastructure support, or no appetite for operating model serving, closed APIs are often the sensible choice. They’re faster to adopt and easier to defend internally.
If you have steady usage, sensitive data, strict latency targets, or deep customization needs, the economics can swing toward open systems pretty fast, especially once egress, token pricing, and vendor roadmap risk show up.
Wolf’s argument has held up well. The durable advantage isn’t a single checkpoint. It’s the stack around it, the standards around that stack, and the fact that thousands of engineers can inspect, improve, and replace parts of it without asking permission.
Senior teams should take that seriously. Pick models carefully. Spend at least as much time on your data pipeline, serving layer, checkpoint format, eval setup, and governance model.
Those decisions will still matter after the benchmark war has moved on.
Useful next reads and implementation paths
If this topic connects to a real workflow, these links give you the service path, a proof point, and related articles worth reading next.
Use open and commercial models where they fit, with evaluation and deployment controls.
How grounded retrieval made internal knowledge easier to use.
TechCrunch Disrupt 2025 is putting two parts of the AI market next to each other, and the pairing makes sense. One is Greenfield Partners with its “AI Disruptors 60” list, a snapshot of startups across AI infrastructure, applications, and go-to-marke...
Clem Delangue, the CEO of Hugging Face, said this week that we’re in an LLM bubble, not an AI bubble, and that he expects it to start deflating next year. The distinction matters. If he’s right, the damage won’t spread evenly across AI. It’ll hit the...
Moonbounce, a startup founded by former Facebook and Apple trust and safety leader Brett Levenson and Ash Bhardwaj, has raised $12 million to sell a specific piece of infrastructure: a real-time moderation layer that sits between users and AI systems...