What does Clem Delangue mean by an "LLM bubble"?

He means the overinvestment and hype around large-scale, general-purpose language models that will likely face a market correction.

Why are smaller language models more cost-effective in production?

They require less memory and compute, offer predictable latency, and perform well on targeted tasks when combined with retrieval techniques.

How can teams lower inference costs for large models?

By applying quantization, leveraging retrieval-augmented generation, and using optimized serving frameworks like vLLM or Hugging Face TGI.

Llm November 19, 2025

Hugging Face CEO on the LLM bubble and why AI may hold up better

Clem Delangue, the CEO of Hugging Face, said this week that we’re in an LLM bubble, not an AI bubble, and that he expects it to start deflating next year. The distinction matters. If he’s right, the damage won’t spread evenly across AI. It’ll hit the...

Hugging Face’s CEO says the bubble is in LLMs. He’s probably right.

Clem Delangue, the CEO of Hugging Face, said this week that we’re in an LLM bubble, not an AI bubble, and that he expects it to start deflating next year.

The distinction matters.

If he’s right, the damage won’t spread evenly across AI. It’ll hit the part of the market built around giant, expensive, general-purpose chat models with weak margins and a pile of copycat products. The rest of the stack keeps moving, especially where smaller models, retrieval systems, vision, audio, and domain tooling solve concrete problems faster and for less money.

For developers, this reads less like a market prediction and more like an architecture warning. Too many teams are still building as if every problem needs one huge model behind an API. That already looks like a bad default.

Where the bubble is

Delangue’s point is simple enough. Large language models pulled in too much money and attention because they demo well, grab headlines, and look like a universal interface. But a lot of production work doesn’t need a model that can riff on philosophy, write sonnets, and explain quantum mechanics.

It needs to classify a support request, extract fields from a document, route a claim, summarize a call, or answer a policy question without inventing facts.

That usually favors a narrower stack:

a smaller model
retrieval over current internal data
constrained outputs
strong evaluation
predictable latency

The banking chatbot example Delangue used gets to the point. If the job is answering account and policy questions, the hard parts are compliance, consistency, and speed. Broad world knowledge mostly gets in the way. A giant chat model can do the job. Often it does it at the wrong price, with the wrong failure modes.

That’s one weak spot in the current LLM market. General models are impressive. Many are also expensive in ways that stop making sense once the demo ends.

Demos flatter big models. Production doesn't.

Anyone who’s had to run these systems at scale already knows the main problem: inference cost stays stubborn.

A 7B parameter model quantized to int4 needs roughly 3.5 GB just for weights before KV cache and serving overhead. That’s workable on a single 16 GB or 24 GB GPU. You can batch requests, keep latency reasonable, and avoid turning inference spend into a board meeting topic.

A 70B model is another story. Now you’re usually dealing with multi-GPU serving, tensor parallelism, more complex scheduling, and uglier economics when traffic spikes. Long context makes it worse. Teams love large context windows because they hide bad system design for a while, but KV cache memory grows fast. At 32K context, memory pressure can hammer throughput even with paged attention and decent serving infrastructure.

Tools like vLLM, Hugging Face TGI, TensorRT-LLM, and ONNX Runtime have improved the picture. Continuous batching, speculative decoding, and smarter cache handling matter. They don’t change the math. Bigger models cost more to run, and long prompts still drag down throughput.

That’s why retrieval-heavy designs keep doing well in real systems. A smaller model with RAG, tight prompts, and function calls often beats a much larger model forced to chew through bloated prompt context full of stale text.

Smaller models are getting the job done

The case for smaller language models, or at least more targeted ones, has gotten stronger over the past year.

Models in the 1B to 13B range are now good enough for plenty of enterprise work, especially after fine-tuning or distillation. With LoRA or QLoRA, teams can adapt a base model to a domain without retraining the whole thing. They can keep multiple adapters, switch them at inference time, and keep the serving footprint under control.

That matters because most enterprise tasks are narrow. You don’t need broad open-ended reasoning to classify loan servicing intents or normalize medical billing codes. You need stable behavior under messy inputs.

Smaller tuned models are often easier to test, too. If the output is schema-bound, tool-routed, and constrained by retrieval, you can evaluate it in ways product and compliance teams will actually trust. That gets much harder when a giant chat model is improvising.

There are limits. Small models still break on edge cases, multilingual nuance, longer reasoning chains, and tasks that combine several kinds of inference at once. Plenty of teams will still need larger models for coding help, research workflows, agentic planning, or customer-facing interactions where ambiguity is part of the job.

Still, the market got ahead of itself by treating the biggest model as the safest choice. In a lot of organizations, it’s just the easiest one to buy.

The stack is getting modular again

If the LLM bubble cools, the next phase will probably look less monolithic.

Instead of one giant model doing everything badly and expensively, teams will build systems from components:

a retriever with hybrid search, often sparse plus dense
a reranker to cut junk from top results
a smaller generation model
task-specific classifiers and extractors
function calling into actual business logic
observability and evals wired into the pipeline

That’s less glamorous, but it fits how production software usually wins. Piece by piece. Measured. Replaceable.

Retrieval matters a lot here. Bad RAG has given the pattern a bad reputation, but good RAG works. Hybrid search with BM25 plus embeddings catches exact terms and semantic matches. A cross-encoder reranker over the top 20 to 50 hits can improve relevance dramatically. Prompt templates with citations make outputs easier to audit. None of this is new, which is partly why it works. Mature systems usually beat expensive magic tricks.

Mixture-of-Experts models could complicate the picture. MoE offers a way to increase capacity without paying full dense-model compute on every token. But serving and scaling MoE cleanly is still messy, and for most buyers it doesn’t solve the bigger issue: paying for capability they don’t need.

An LLM reset would help open-weight models

A cooler LLM market would also strengthen the case for open-weight models.

If general chat endpoints start to look interchangeable, buyers will care more about price per million tokens, data controls, licensing clarity, and deployment flexibility. That’s where open models from families like Llama, Mistral, Phi, and Gemma have room to gain, especially in VPC or on-prem environments where governance matters more than squeezing out one more benchmark point.

Hugging Face is in a good position if that shift happens. Delangue said the company still has about half of the $400 million it raised sitting in the bank. That gives Hugging Face room to keep investing across tooling, infrastructure, and multimodal work instead of chasing a single giant-model strategy.

That seems sensible. A lot of durable value in AI may end up sitting around the model rather than inside the biggest frontier model: selection, distribution, evaluation, security, and deployment.

What technical teams should do now

If you’re building AI features right now, the practical takeaway is blunt: stop treating model size as product strategy.

A few priorities stand out.

Start with evals and unit economics

Measure quality against your task, not a leaderboard screenshot. If you run support automation, track cost per resolved ticket at a target latency and error rate. If you process documents, measure extraction accuracy against real document variance, not curated samples.

Benchmarks like MMLU, GSM8K, HumanEval, or MTEB are useful as rough baselines. They are not deployment criteria.

Design retrieval before you pick the model

A lot of teams still do this backward. They pick a big model first, then try to patch hallucinations with retrieval later.

Use the smallest model that passes once retrieval, reranking, and structured outputs are in place. FAISS is still a solid option on bare metal. pgvector keeps things simple if you’re already on Postgres. Managed vector stores make sense when ops time is the actual constraint.

Keep inference boring

Quantize aggressively, but test drift. Keep an fp16 canary around. Use vLLM or TGI if throughput matters. Watch p95 latency, cache behavior, and prompt length. Long context is often a system smell.

Treat security as part of the system

Prompt injection, data exfiltration, and model supply-chain issues are now standard engineering problems. Model artifacts need provenance. Policies need to be enforceable in CI. If your LLM workflow can call tools or query internal data, build controls around the assumption that input is hostile.

That work has been underfunded because the market rewarded demos. That won’t hold forever.

Delangue may be early on timing, but the technical direction is hard to dispute. Throwing a giant chat model at every problem is already wearing thin. Teams that build for fit, cost, and control will be in better shape when budgets tighten.

What to watch

The main caveat is that an announcement does not prove durable production value. The practical test is whether teams can use this reliably, measure the benefit, control the failure modes, and justify the cost once the initial novelty wears off.

Keep going from here

Useful next reads and implementation paths

If this topic connects to a real workflow, these links give you the service path, a proof point, and related articles worth reading next.

Relevant service

RAG and AI systems

Use open and commercial models where they fit, with evaluation and deployment controls.

Related proof

Internal docs RAG assistant

How grounded retrieval made internal knowledge easier to use.

xAI released Grok 2.5 weights on Hugging Face, but not as open source

xAI has published Grok 2.5-era model weights on Hugging Face under xai-org/grok-2, and Elon Musk says Grok 3 will follow in about six months. The catch is the license. This looks like an open-weights release under a custom license, not open source in...

Thomas Wolf on open AI infrastructure and what developers actually use

TechCrunch Disrupt 2025 put Thomas Wolf onstage because he’s spent years turning "open AI" into actual software that developers use. Wolf, Hugging Face’s co-founder and chief science officer, has been tied to some of the most important infrastructure...

Google's startup chief says AI wrapper apps and model routers face a hard future

Google’s Darren Mowry, who oversees startups across Google Cloud, DeepMind, and Alphabet, had a straightforward message for AI founders: if your company is basically a UI on top of someone else’s model, or a switchboard routing prompts between models...