What methods did Anthropic use to identify conversation categories?

They used preprocessing to remove PII, session segmentation, and an intent classifier trained to label chats as productivity, personal advice, or companionship.

What falls under productivity use cases in the study?

Productivity covers tasks like writing, brainstorming, coding, summarizing, and document search functions.

How should these findings impact AI product roadmaps?

Teams should prioritize features that enhance day-to-day productivity—like faster inference, tool integration, and session memory—over personality-driven features.

Generative AI June 27, 2025

Anthropic study finds 2.9% of Claude chats involve personal advice

Anthropic looked at 4.5 million Claude conversations and found a pretty simple pattern: people mostly use chatbots for work. The numbers are clear. Just 2.9% of Claude interactions involve emotional support or personal advice. Fewer than 0.5% fall in...

Anthropic’s Claude data cuts through the AI companion hype

Anthropic looked at 4.5 million Claude conversations and found a pretty simple pattern: people mostly use chatbots for work.

The numbers are clear. Just 2.9% of Claude interactions involve emotional support or personal advice. Fewer than 0.5% fall into companionship or roleplay. More than 90% are about productivity, which covers writing, brainstorming, coding, summarizing, and getting through work faster.

That matters because a lot of the public conversation around AI keeps drifting toward companions, synthetic friends, and emotional attachment. Those use cases are real. They just aren't the center of gravity. If you're building AI products, running an applied ML team, or deciding where inference spend goes, Anthropic's data points somewhere less flashy: build the work assistant first.

A useful correction to the AI narrative

A lot of product teams probably won't find this surprising. Enterprise customers weren't asking for digital soulmates. They wanted better document search, cleaner summaries, SQL help, code generation that doesn't collapse halfway through a refactor, and a chatbot that can survive a 20-turn workflow without losing the plot.

Anthropic's dataset gives that instinct real backing.

It also corrects a familiar distortion in how the AI industry talks about itself. Public demos and investor stories tend to over-index on emotionally loaded use cases because they're easy to package. "People fall in love with bots" gets attention. "People use bots to rewrite sales emails and debug Python scripts" sounds dull, even though that's where a lot of actual usage lives.

For technical teams, dull is fine. Dull tends to scale. It also tends to pay.

Why this should change product priorities

If more than 90% of interactions are productivity-oriented, roadmaps should reflect that.

Spend less time on "personality" features and more time on things users can actually feel in day-to-day work:

lower latency on long prompts
better retrieval quality
stronger tool use and function calling
reliable file handling
session memory that actually helps
domain-specific routing for code, analytics, legal, or support tasks

A lot of chatbot products still drag consumer-social assumptions into environments where users are trying to get something done. You can see it in the UX. Too much chatter, too much stickiness, weak tool handoffs. For internal copilots especially, the system should behave more like an orchestrator.

This also changes how you think about model mix. If the dominant workload is knowledge work, one giant general-purpose model handling everything is usually the wrong answer. Intent detection up front makes more sense, followed by routing. Code tasks can go to a code-tuned model. Structured extraction can hit a smaller cheaper model. Summaries can use a fast summarizer. High-risk emotional content can trigger tighter moderation or human escalation.

That setup is less romantic than "one chatbot for everything." It's also a better way to keep costs under control.

How Anthropic likely measured intent at scale

Anthropic hasn't published every implementation detail, but classifying millions of conversations into buckets like "productivity," "personal advice," and "companionship" usually follows a familiar pipeline.

First come the session logs and metadata: timestamps, plan tier, turn count, maybe some coarse usage context. Then preprocessing: remove or mask PII, segment conversations into sessions, and normalize the text enough that the classifier isn't getting distracted by noise.

After that, you need some representation of intent. A common approach is to use sentence or conversation embeddings from a model in the MiniLM or e5 family, or a more specialized encoder if the taxonomy gets tricky. From there, you can cluster sessions, inspect samples, label them manually, and train a downstream classifier.

A rough version looks like this:

from transformers import AutoTokenizer, AutoModel
import torch

tokenizer = AutoTokenizer.from_pretrained("sentence-transformers/all-MiniLM-L6-v2")
model = AutoModel.from_pretrained("sentence-transformers/all-MiniLM-L6-v2")

def embed(text):
inputs = tokenizer(text, return_tensors="pt", truncation=True)
outputs = model(**inputs)
return outputs.last_hidden_state.mean(dim=1)

That embedding step is the easy part. The hard part is taxonomy design.

"Personal advice" and "companionship" aren't clean categories. Someone asking how to handle burnout at work might count as emotional support, career advice, or productivity coaching depending on framing. "Roleplay" could mean harmless creative writing, sexual content, therapy simulation, or product testing. Force all of that into rigid bins and you get tidy charts that flatten messy behavior.

So yes, take the percentages seriously. Just don't pretend they're exact in some clinical sense. The directional finding matters most. Emotional and companion-style use is a small minority. Productivity dominates by a wide margin.

Safety still matters, even at low volume

Low volume doesn't mean low importance.

If under 3% of sessions involve emotional support or personal advice, that's still a big absolute number at scale. A model used by millions of people will still get plenty of crisis-adjacent conversations, self-harm language, dependency signals, and requests that slide into clinical territory.

That leaves teams with an awkward engineering problem. You probably can't justify shaping the entire product around those cases. You also can't afford to treat them like noise.

The sensible move is targeted handling:

intent classifiers that flag high-risk content early
policy-specific moderation for self-harm, abuse, eating disorders, and delusional reinforcement
stricter response templates or constrained generation in sensitive domains
human escalation paths where the product and jurisdiction support it
audit logs and review loops for borderline cases

A lot of teams still bolt this on after launch. That's sloppy. If your assistant accepts freeform text, vulnerable users will show up eventually, even if the product was built for sprint planning and meeting notes.

Anthropic also reported that conversations involving advice tend to show more positive sentiment trajectories. That's worth logging, but it doesn't prove much on its own. A user feeling better at the end of a chat doesn't mean the model gave safe or clinically sound guidance. Sentiment shift is a weak proxy. Fine for analytics. Not much use as evidence.

What it means for infrastructure and cost

If productivity is the dominant workload, the bottlenecks are pretty familiar.

Throughput. Context handling. Retrieval quality. Tool execution latency. Failover when external systems break. And then the usual enterprise mess: giant PDFs, half-structured tables, stale documentation, old codebases, and users pasting twelve unrelated things into one chat and expecting coherent output.

That pushes teams toward a few practical patterns.

Intent-aware routing

Don't send every request to the same expensive model. Classify first, then dispatch. That can be a lightweight classifier or a more elaborate routing graph with confidence thresholds and fallback models.

Memory with discipline

Productivity sessions often run across multiple turns, but persistent memory is easy to overdo. Store embeddings of useful prior context in a vector database such as Pinecone or Weaviate, then retrieve selectively. Dumping full histories into every prompt is slow, expensive, and usually dumb.

Analytics that feed product decisions

Telemetry should tell you which intents dominate, where sessions fail, and which prompts lead to retries or abandonment. If "brainstorming" is huge and "database query help" is growing, that's roadmap input. If "explain this error" has high fallback rates, the code assistant needs work.

This sounds obvious. Plenty of teams still don't have it.

The companion AI story isn't going away

There will still be demand for AI companionship, roleplay, and emotional support. Some companies are built around those use cases. Some users clearly want them. But Anthropic's numbers suggest they shouldn't be treated as the default model of chatbot behavior.

For enterprise and developer-facing products, they aren't.

That should cool off a few bad arguments. We don't need to act like every chat interface is secretly a social relationship product. In most cases it's closer to middleware with a natural-language front end. The hard parts are accuracy, orchestration, retrieval, permissions, safety, and cost control. Human attachment may happen at the edges. It doesn't define the center.

For technical leaders, the takeaway is simple. Audit your own logs. If your usage looks anything like Claude's, put resources where users already are: code, writing, analysis, workflow automation, and the plumbing that makes those reliable. Keep the safety systems sharp for the smaller share of emotional conversations, because the risk there is real.

Most users aren't looking for a friend. They're trying to get work done.

Keep going from here

Useful next reads and implementation paths

If this topic connects to a real workflow, these links give you the service path, a proof point, and related articles worth reading next.

Relevant service

AI model evaluation and implementation

Compare models against real workflow needs before wiring them into production systems.

Related proof

Internal docs RAG assistant

How model-backed retrieval reduced internal document search time by 62%.

Anthropic launches Claude Design for AI-generated prototypes, one-pagers, and decks

Anthropic has launched Claude Design, an experimental product that turns a text prompt into prototypes, one-pagers, and slide decks. That pitch lands in an already crowded category. Canva has expanded its AI stack, Microsoft keeps adding generation t...

Anthropic, OpenClaw, and the account risk behind AI agent systems

Anthropic temporarily suspended OpenClaw creator Peter Steinberger’s access to Claude, then restored it. That may sound like a minor account moderation issue. It matters more than that if you build agent systems. The immediate dispute is simple enoug...

Anthropic's $3.5B raise puts real weight behind Apple and Claude Dev

Anthropic has two things going on, and they connect pretty directly. The company just raised $3.5 billion at a $61.5 billion valuation, which tells you investors still believe frontier model companies can turn huge burn into durable businesses. At th...