When should I use fine-tuning versus RAG?

Use fine-tuning for stable tasks requiring tone or format compliance and RAG for dynamic or frequently updated knowledge sources.

Why are midsize models often preferred to larger ones?

They offer a better balance of performance, latency, and cost when properly trained and combined with retrieval.

How does quantization improve model deployment?

Quantization shrinks model size and speeds up inference with minimal accuracy loss, reducing memory and compute costs.

Llm June 21, 2025

Large language models in 2026: what has actually settled for developers

IBM’s latest large language model explainer isn’t groundbreaking, but it does capture where the market has settled after two years of shipping and rework. The basics are stable. Transformers still run the show. Bigger models still help, up to a point...

LLMs in 2026: what developers still get wrong about scale, deployment, and real-world use

That’s the part worth focusing on.

For most engineering teams, the hard problem now is model selection, fine-tuning versus retrieval, latency, and keeping the whole stack from becoming an expensive hallucination factory.

The model race cooled off. Systems work didn’t.

LLMs still sit on the same foundation: transformer architectures trained on massive text corpora, often mixed with code, images, and audio. IBM runs through the usual model families, including GPT-class systems, Llama variants, PaLM-era Google models, and IBM’s Granite line. None of that is surprising.

What changed is how teams think about deploying them.

A year ago, plenty of people still treated parameter count as the main proxy for model quality. That view aged badly. Chinchilla-style scaling laws already pushed the field away from brute-force size, and the practical lesson is now obvious. Training budgets and serving budgets matter as much as raw parameters. A well-trained midsize model with solid data, decent instruction tuning, and the right retrieval setup often outperforms a larger general-purpose model on real business tasks, especially when cost and latency matter.

Too many projects still start with prestige model shopping. They should start with task shape, latency budget, privacy constraints, and failure tolerance.

If the workload is internal document Q&A, coding help inside a known stack, or summarization of structured business text, a smaller model with retrieval will often produce the better system.

Transformers still matter, even if the architecture lecture is old news

The core architecture hasn’t changed enough to skip the basics. Self-attention still lets a model weigh relationships between tokens across a sequence. Multi-head attention still helps it track different context patterns in parallel. The underlying code is familiar by now: project queries, keys, and values, split into heads, compute attention scores, combine the result.

The important part is operational.

Attention is expensive. Long sequences hit memory and latency hard. That’s why context windows became a product feature, and why teams keep running into bills they didn’t plan for. Long prompts are convenient in prototyping. In production, they’re a tax.

When vendors pitch giant context windows, read the fine print. Check the token bill. Check latency under realistic concurrency. Check whether the application actually needs a 100k-token prompt, or whether the retrieval layer is just sloppy.

A lot of architecture talk in production comes down to basic systems hygiene.

Fine-tuning helps. Retrieval is often the first thing to fix.

IBM puts fine-tuning, RLHF, and RAG in the same frame, which is fair. They solve different problems.

Fine-tuning changes model behavior. It helps with tone, format compliance, workflow-specific outputs, and domain tasks where labeled examples actually capture what good behavior looks like.

RAG changes what the model sees at inference time. It’s usually the cleaner answer when the problem is missing knowledge or frequently changing knowledge.

Teams still get this wrong. If the source material changes every week, or the system needs grounded answers from internal docs, policy files, tickets, contracts, or codebases, retraining the model is the wrong operating model. You want a retrieval pipeline, sane chunking, decent embeddings, and a reliable way to keep the source of truth current.

The sample stack IBM references, with FAISS, Hugging Face embeddings, and a generator model, is still a reasonable mental model for a basic RAG app. But a demo stack isn’t a reliable one.

Retrieval quality depends on boring details:

chunk size and overlap
embedding model choice
metadata filtering
reranking
citation handling
prompt construction
evaluation against real queries, not demo prompts

Get those wrong and RAG turns into a system that feeds irrelevant text to a model that confidently assembles nonsense.

Quantization and distillation are standard now

If you’re still serving everything in high precision because quality matters, there’s a good chance you’re burning money.

Quantization is normal production work now. FP16, INT8, and even 4-bit formats can cut memory use and inference cost sharply, especially on commodity GPUs or edge hardware. The right trade-off depends on how much quality loss you can tolerate and what kind of outputs you need, but the direction is clear. For many inference workloads, lower precision is good enough.

Distillation matters for the same reason. A smaller student model trained to mimic a larger teacher is easier to serve, easier to scale, and easier to defend in a budget review. In plenty of enterprise deployments, slightly worse at a fraction of the cost wins.

This is where LLM decisions stop being academic. If an app handles thousands of short classification or summarization tasks per minute, shaving latency and GPU memory is worth more than squeezing out a tiny benchmark gain.

MLOps for LLMs is finally starting to look like engineering

The first wave of LLM apps treated monitoring as an afterthought. Response time, maybe a thumbs up or thumbs down, and move on. That was never enough.

LLM systems need observability that matches how they fail. Token usage matters. Prompt drift matters. Retrieval quality matters. Hallucination rate matters too, though measuring it well is harder than many vendors suggest. Log probabilities, guardrail triggers, fallback rates, human review loops, citation coverage, and task-level eval sets all belong in the picture.

IBM’s mention of token-level drift and human-in-the-loop workflows is a solid baseline. In practice, teams also need to separate model failures from system failures. Was the answer wrong because the model guessed? Because retrieval missed the right document? Because the prompt boxed the model into bad output? Because the source data was stale?

If you can’t answer that, you don’t have an LLM platform. You have a support problem.

Cloud APIs are still the fastest way to ship. Self-hosting still buys control.

That trade-off is the same. The edges are sharper now.

Closed APIs from OpenAI, Anthropic, and similar vendors are still the quickest route to production. You get top-tier models, decent tooling, and fewer infrastructure headaches. For prototypes and a lot of production apps, that’s still the right choice.

The downsides never went away. Data governance, residency, cost predictability, rate limits, model changes outside your control, and vendor lock-in all start to bite once the application matters.

That’s why open-weight models and on-prem deployments keep gaining ground. Granite, Llama-family models, and other self-hosted options appeal to teams that need tighter security, offline deployment, or more control over runtime behavior. The trade-off is operational complexity. You own the serving stack, scaling, hardware planning, and incident response.

These options are not interchangeable.

For regulated sectors, internal code assistants, or systems that touch sensitive documents, self-hosting is often worth the hassle. For customer-facing general assistants where model quality matters most and data risk is manageable, APIs still make sense.

Code generation, chatbots, and semantic search are not equal bets

Some use cases are proving sturdier than others.

Code generation remains one of the strongest because the feedback loop is tight. Generated code can be compiled, tested, linted, and reviewed. Errors are easier to catch. That makes the output far easier to govern than open-ended text generation.

Semantic search is also holding up well. Good embeddings and a decent index still solve real problems, especially in internal knowledge systems.

Chatbots are still overused and often underdesigned. Teams keep shipping broad “ask me anything” interfaces when they should be building narrower task agents with constrained tools, bounded context, and clear escalation paths. General chat feels flexible. In practice, it hides failure until a user runs into it.

The strongest LLM applications now look like composed systems: retrieval, ranking, tool use, structured output, policy checks, and a model in the middle.

Multimodal models are moving fast, but most text-only teams have bigger problems

IBM points to multimodal systems in the GPT-4o mold, and that’s fair. The convergence of text, vision, and audio is real. It matters for support systems that read screenshots, workflow tools that process documents and speech, and developer tools that reason across UI and code.

Still, many teams haven’t exhausted what they can do with text plus retrieval. Multimodal capability is promising, but it doesn’t fix weak execution. If a chatbot still can’t cite documents correctly or stay inside policy boundaries, image input won’t save it.

The same goes for adaptive inference and mixture-of-experts routing. Smart routing can cut average compute and latency. It matters. It also comes after building a product that answers the right question with the right evidence.

What deserves attention now

If you’re making LLM decisions in 2026, the priorities are fairly clear:

pick the smallest model that meets the task
use retrieval before fine-tuning
treat quantization as standard practice
build evaluation and observability into the stack from day one
choose API versus self-hosting based on governance and operations, not ideology
stop treating broad chat interfaces as product design

The industry is past the stage where calling a model feels impressive by itself. Strong teams are judged on the system around the model: cost, latency, traceability, safety, and whether the thing actually helps people do useful work.

That’s where the engineering is.

Keep going from here

Useful next reads and implementation paths

If this topic connects to a real workflow, these links give you the service path, a proof point, and related articles worth reading next.

Relevant service

Data science and analytics

Turn data into forecasting, experimentation, dashboards, and decision support.

Related proof

Growth analytics platform

How a growth analytics platform reduced decision lag across teams.

Mistral AI, explained: models, products, and its OpenAI comparison

--- Mistral AI keeps getting called Europe’s answer to OpenAI. That’s an easy label, and a sloppy one. Mistral does build large language models. It has a chat product, now called Vibe, and it still wants a place in the frontier-model race. But th...

How OpenAI's MathGen work led to the o1 reasoning model

OpenAI’s o1 reasoning model makes more sense when you look past the product label and at the system behind it. The key point from reporting on OpenAI’s internal MathGen team is straightforward: it spent years pushing models past pattern-matching and ...

ChatGPT after GPT-5: OpenAI shifts from a model to a routed stack

OpenAI is no longer selling ChatGPT as a single flagship model story. GPT-5 is the headline, sure. The more important shift is the stack around it. ChatGPT now looks like a routed system with multiple performance tiers, multiple underlying models, ag...