Why does polite language add to inference costs?

Each polite word counts as a token, increasing computation and memory demands per request.

Are output tokens billed the same as input tokens?

Yes, APIs typically charge for both input and output tokens, so longer responses increase costs.

How much can trimming tokens save in high-volume workflows?

Saving just a few tokens per prompt and response can eliminate millions of tokens and cut significant spend.

Llm April 22, 2025

What Sam Altman's ChatGPT joke gets right about LLM token economics

Sam Altman said this week that “please” and “thank you” have cost OpenAI “tens of millions of dollars” in compute. It was a joke, mostly. It was also a blunt summary of how LLM economics work. Extra tokens cost money. At ChatGPT scale, even a little ...

Sam Altman says “please” and “thank you” cost OpenAI millions. The joke lands because the math is real.

Extra tokens cost money. At ChatGPT scale, even a little politeness adds up.

Nobody needs to start barking at chatbots. But teams building AI products should stop acting like prompt verbosity is free. Token discipline belongs in system design now, alongside latency budgets, cache hit rates, and GPU utilization.

Why the comment matters

The lazy takeaway is that people are wasting money by being polite to a bot. That misses the point. Natural language interfaces invite filler, repetition, and social ritual, and transformer models chew through all of it.

A web form doesn’t cost more because someone types “Hi there, could you help me with…” before the actual request. An LLM-backed product does. The full prompt gets processed, and the tone often pulls a longer response with it.

That matters for consumer chat. It matters even more for enterprise and developer tools, where usage is high, margins are tighter, and nobody benefits from anthropomorphic UX if inference spend doubles.

At a million prompts a day, trimming a few tokens from both request and response can move real money. Add routing, retries, tool calls, system prompts, conversation history, and verbose outputs, and the “please” problem stops sounding trivial.

Why politeness costs money

LLMs don’t treat courtesy as some cheap metadata bit. They process words as tokens, and tokens drive the work.

“Please summarize this report” adds a few input tokens compared with “summarize this report.” “Thank you” adds a few more. On a single prompt, who cares. At scale, two things matter:

Inference cost rises with tokens processed. More input tokens mean more attention computation and more memory bandwidth pressure, especially with long contexts.
Prompt style changes output length. Polite, conversational phrasing often nudges the model toward softer, fuller answers. That’s often where the bigger cost shows up.

That second point gets overlooked. In many production workloads, finance cares at least as much about output tokens as input tokens. Sometimes more. A friendly prompt can easily produce a friendlier, longer answer. You’re not just paying for “please.” You’re paying for the conversational mode it triggers.

There’s hidden overhead too. Users rarely send one clean sentence. Systems prepend instructions, safety rules, tool schemas, retrieval context, and chat history. A little fluff in the user message lands inside an already expensive prompt envelope.

Small token counts become real bills

You can see the basic effect with a tokenizer:

from tiktoken import get_encoding

enc = get_encoding("gpt2")

def count_tokens(text: str) -> int:
return len(enc.encode(text))

base = "Summarize the quarterly report."
polite = "Please summarize the quarterly report, thank you."

print(count_tokens(base))
print(count_tokens(polite))

The exact count depends on the tokenizer. The pattern doesn’t. Polite phrasing adds tokens, and across millions of requests those tokens turn into line items.

The rough math is simple. Trim 4 input tokens and 12 output tokens from a high-volume workflow. That’s 16 tokens saved per interaction. At 10 million interactions per month, that’s 160 million tokens gone. Depending on model pricing, routing policy, and whether you run your own GPUs or pay an API provider, that ranges from noticeable to ugly.

Raw token billing is only part of it. Longer requests and responses also hit:

latency
concurrency limits
cache efficiency
GPU memory pressure
power draw
carbon footprint

This is one of the rare cases where small optimizations really do compound.

Tone changes model behavior

There’s an awkward truth here: polite prompts often produce better answers.

Models are sensitive to phrasing. That’s not etiquette and it’s not magic. It’s training data. These systems learned from huge amounts of human text where tone correlates with intent, context, and expected response style. Ask politely and the model often infers that you want care, completeness, and structure. Be terse or abrasive and it may mirror that with a shorter answer.

Microsoft and others have said as much in guidance for assistants. Prompt engineers have known it for years. Tone is a control surface.

That leaves product teams with a real trade-off. Strip prompts down to bare commands and you may save tokens, but you can also make responses feel brittle or too compressed. For internal tooling, that may be fine. For customer support, education, coaching, or onboarding, often it isn’t.

So don’t ban politeness. Stop paying for it in the dumbest way possible.

The fix is architectural

If users want to type naturally, let them. The practical move is to separate the user experience from the model payload.

Put a preprocessing layer between the UI and the model. Keep the user’s tone in the interface, then normalize the instruction sent to the backend.

For example:

User types: “Hi, could you please summarize this PDF for me? Thanks.”
Backend sends: summarize attached_pdf in 5 bullet points

That’s the pattern to aim for.

It works especially well in products with predictable intents: summarization, extraction, classification, code review, ticket triage, SQL generation. Once you know the task, free-form politeness is usually noise from the model’s perspective.

A few practical approaches help:

Intent normalization

Map messy user phrasing to compact internal commands or structured prompts. Easy win for apps with a known task set.

Tone as metadata

If you want friendly output, don’t spend tokens over and over saying “be friendly and helpful.” Store tone preference as a flag and inject the smallest instruction you need, or handle some of that tone in the presentation layer.

Output constraints

Verbose prompts often produce verbose answers. Set a response format: 3 bullets, JSON only, under 80 words. This often saves more than shaving prompt text.

Conversation compaction

Most real waste doesn’t come from “please.” It comes from dragging full chat histories forward. Summarize prior turns, drop stale context, and stop resending tool results the model no longer needs.

Caching and deduplication

If lots of users ask the same thing in slightly different language, normalize and cache aggressively. Natural language variation kills cache hit rates unless you clean it up first.

Where this matters most

For a one-off consumer chat session, this isn’t worth obsessing over. OpenAI will survive your manners.

For production systems, it depends on the workload.

Internal copilots and developer tools

These are good candidates for terse prompts and strict output schemas. Engineers care about speed and correctness. A code review bot doesn’t need to flatter anyone.

Customer-facing assistants

Warmer language often matters here. But a pleasant UI and a bloated backend prompt are different things. Keep the payload compact and let the frontend carry some of the social tone.

Agentic systems

This is where token waste gets nasty. Multi-step agents call models repeatedly, carry state, invoke tools, and generate planning text. Prompt sprawl multiplies across the execution graph. If you’re building agents and not auditing token flow step by step, you’re guessing.

On-prem and self-hosted inference

The economics shift, but the performance problem doesn’t. You may not see per-token API charges, but longer prompts still cut throughput and raise infrastructure needs. That means more GPUs, more queueing, or worse latency.

The sustainability point is real

AI carbon accounting gets fuzzy fast, and companies are happy to keep it vague. Still, the basic point holds. Wasted tokens mean wasted compute, and wasted compute means wasted energy.

No single “thank you” is cooking the grid. But at internet scale, habitual prompt bloat is exactly the kind of hidden inefficiency that accumulates across large systems. Engineers already care about image sizes, SQL queries, bundle weight, and cache headers for the same reason. LLM traffic deserves the same treatment.

What teams should do

Start with measurement.

Track average input and output tokens by task, not just by model. Find the workflows where language overhead is pushing up cost or latency. Then decide where natural conversation is worth paying for and where it isn’t.

A few questions usually surface the problem fast:

Which user flows have the highest token-to-value ratio?
How much of the prompt is task content versus wrapper text?
Are long outputs helping users, or just sounding helpful?
Can the UI stay conversational while the backend prompt gets compressed?
Are you paying to resend context the model no longer needs?

Altman’s comment landed because it exposed something the AI industry often tries to blur. These systems feel soft and human on the surface. Underneath, the economics are mechanical.

Every word gets billed. Some words earn their keep. Plenty don’t.

Keep going from here

Useful next reads and implementation paths

If this topic connects to a real workflow, these links give you the service path, a proof point, and related articles worth reading next.

Relevant service

AI model evaluation and implementation

Compare models against real workflow needs before wiring them into production systems.

Related proof

Internal docs RAG assistant

How model-backed retrieval reduced internal document search time by 62%.

ChatGPT after GPT-5: OpenAI shifts from a model to a routed stack

OpenAI is no longer selling ChatGPT as a single flagship model story. GPT-5 is the headline, sure. The more important shift is the stack around it. ChatGPT now looks like a routed system with multiple performance tiers, multiple underlying models, ag...

What OpenAI's GPT-4.5 immigration case reveals about AI staffing risk

A researcher who worked on GPT-4.5 at OpenAI reportedly had their green card denied after 12 years in the US and now plans to keep working from Canada. That is an immigration story. It's also a staffing, operations, and systems problem for any compan...

OpenAI's GPT-5 pricing puts real pressure on model margins

OpenAI priced GPT-5 low enough to force a serious conversation about margins across the model market. The headline numbers: - $1.25 per 1 million input tokens - $10 per 1 million output tokens - $0.125 per 1 million cached input tokens That matters r...