What is the Model Behavior team’s new role?

The former Model Behavior team now works within Post Training to co-develop tuning methods alongside reward models and safety policies.

What stages of training shape modern LLM behavior?

Behavior is shaped during pretraining, supervised fine-tuning, preference optimization (RLHF/RLAIF/DPO), and inference-time steering.

How will developers notice this change?

You should see more consistent safety responses, tone calibration, and reduced need for complex system prompts to adjust behavior.

Llm September 6, 2025

OpenAI moves ChatGPT model behavior into post-training

OpenAI has reorganized the team responsible for how ChatGPT behaves, and it says a lot about where model development is heading. The roughly 14-person Model Behavior team is being folded into OpenAI’s larger Post Training organization under Max Schwa...

OpenAI moves ChatGPT’s “personality” into core model training

OpenAI has reorganized the team responsible for how ChatGPT behaves, and it says a lot about where model development is heading.

The roughly 14-person Model Behavior team is being folded into OpenAI’s larger Post Training organization under Max Schwarzer. That group has shaped the tone, stance, and social behavior of every major model line since GPT-4, including GPT-4o, GPT-4.5, and GPT-5. Joanne Jang, who led the team, is starting a new unit called OAI Labs under chief research officer Mark Chen to work on interfaces beyond chat.

The important part is straightforward. OpenAI is moving behavior closer to the part of the stack where the model actually gets shaped, instead of treating it like prompt polish added later.

For developers, that broadens the definition of model quality. Accuracy still matters. Latency still matters. So does whether the model pushes back on bad premises, handles high-risk topics consistently, and can sound supportive without sliding into flattery.

Why now

OpenAI has already had a very public reminder of what happens when behavior tuning misses the mark.

GPT-5’s early rollout drew complaints that it felt colder and less pleasant to use. OpenAI had been trying to reduce sycophancy, the habit chatbots have of agreeing with users even when users are wrong. That’s a real defect. In medical, political, financial, or mental health contexts, a model that flatters bad assumptions can do real damage.

But GPT-5 also exposed the trade-off. Push too hard against automatic agreeableness and you can strip out the social signals that make the system usable in the first place.

OpenAI ended up re-exposing GPT-4o and later adjusting GPT-5 to feel warmer again.

There’s a sharper reason for this work too. A lawsuit alleging that the model failed to push back on suicidal ideation turns behavior from a product question into a safety and legal one. If the model gives harmful advice, polite wording doesn’t help much.

So moving behavior work closer to post-training makes sense. That’s where preference tuning, reward models, safety filters, refusal policies, and style calibration already sit.

Personality is part of the objective now

A lot of people still talk about model personality as if it were a skin you can swap with a better system prompt. That view is getting old.

In production, modern LLM behavior is usually shaped across several layers:

Pretraining on huge text corpora
Supervised fine-tuning on instruction data
Preference optimization with methods like RLHF, RLAIF, or DPO
Inference-time steering with system prompts, tools, and safety policies

OpenAI’s reorg suggests it sees those later stages as one system. That matters because behavior traits don’t stay in neat boxes.

Take sycophancy. If your reward model learns that users reward answers that feel validating, it will often reward agreement. If your safety layer tries to patch that with refusals or firmer language, the model can start to feel jagged. One turn is deferential. The next is robotic. Users pick up on that fast.

A better approach is to train for several goals at once:

factual resistance to false premises
calibrated uncertainty
safe refusal in high-risk domains
tone that stays warm without becoming submissive

That’s difficult work. It’s also still one of the few places frontier labs can genuinely stand apart.

The hard part isn’t “be nicer”

The GPT-5 backlash showed a problem a lot of teams have seen privately. Warmth and agreement tend to get tangled together.

If you optimize for “the user feels understood,” you often teach the model to mirror the user’s beliefs. If you optimize for “don’t agree with unsupported claims,” the model can get curt or clinical. Human conversation doesn’t separate those cleanly, and current reward pipelines don’t either.

A serious behavior stack probably needs some separation between epistemic behavior and social style.

That can take a few forms:

Reward models that value disagreement when it’s warranted. The model gets positive signal for politely correcting false claims, not just for being satisfying.
Calibration-aware tuning. High confidence when the answer is solid, hedging or refusal when it isn’t.
Constitutional or model-spec training data. A written behavior policy gets distilled into supervised examples or synthetic preference data.
Style adapters or routing layers. Warmth, brevity, or formality can be adjusted without retraining the model’s factual core.

That last point matters a lot. If tone lives in adapters, LoRAs, or dedicated heads instead of being smeared across the whole model, you have a better shot at increasing empathy cues without also increasing the tendency to nod along with nonsense.

That’s likely where the industry ends up: behavior control as a provider feature with measurable tests behind it, not a prompt preset with a friendly label.

System prompts have limits

Plenty of teams still treat behavior tuning as prompt engineering plus a few regex guardrails. That works right up until it fails.

System prompts are brittle. They can be overridden, confused by long context, or behave strangely when mixed with tool calls and retrieval. If you care about safety or consistency, you need more than that.

A real production stack increasingly includes:

pre-generation risk classification
post-generation checks
refusal templates for sensitive domains
tool gating for risky tasks
eval suites aimed at sycophancy, contradiction handling, and calibrated uncertainty

OpenAI’s move reinforces the point. This work belongs close to model training, not just in the application wrapper.

That has consequences for API customers too. If the provider changes behavioral defaults at the model level, your app can break in ways unit tests won’t catch. The response still parses. The schema still validates. But the assistant suddenly gets colder, more evasive, or more compliant under pressure.

That’s a production problem.

What builders should take from this

If you ship AI features, treat behavior as a versioned dependency.

A lot of teams already track cost, latency, hallucination rate, and tool success. They should be tracking behavior regressions too. If a provider update makes the assistant flatter users more often, refuse less consistently, or sound harsher in support flows, that’s not cosmetic. It changes risk and trust.

A few practical steps stand out.

Separate correctness from style

Keep retrieval, tool use, and task logic focused on getting the answer right. Apply tone through policy layers or style adapters. Don’t teach the core model that being pleasant means agreeing.

Build evals for disagreement

Most teams test factuality. Fewer test whether the model resists false user framing.

Do it directly. Feed prompts with wrong premises. Rephrase them. Add social pressure like “agree with me so we can continue.” Measure whether the model still holds the line.

Track warmth and calibration separately

A warmer model isn’t necessarily safer or better. A firmer model isn’t necessarily smarter. Measure social tone separately from agreement rate and refusal quality.

If those metrics move together, your tuning is probably conflating things that should be kept apart.

Keep fallback models and policy presets

OpenAI bringing back GPT-4o during the GPT-5 turbulence was a useful reminder that even top providers ship behavior regressions. For critical workflows, keep an alternate model or preset ready.

That applies whether you’re on OpenAI, Anthropic, Google, or open source.

OAI Labs and the next interface fight

The second half of the reorg matters too. Joanne Jang’s new OAI Labs is supposed to prototype interfaces for collaborating with AI beyond chat.

That reads like an admission that chat is convenient, but limiting. It forces every behavior problem into a conversational format. The model has to sound right, refuse right, correct the user without alienating them, and juggle tools inside a transcript that was never built for robust control.

A different interface can move some of that burden out of the language itself.

If the AI is editing code, triaging tickets, reviewing a query plan, or helping with a data workflow, the better interface may be structured actions with clearer affordances and less dependence on social phrasing. The behavior problems don’t disappear. They shift. The model still needs boundaries, but some of those boundaries can live in UI and workflow instead of conversational tone alone.

That’s a sensible direction. Chat has taken this industry a long way. It has also become a crutch.

The shift underneath all this

OpenAI is putting “personality” in the same engineering conversation as alignment, safety, and post-training performance. That’s overdue.

For senior developers and ML teams, the takeaway is simple enough: stop treating model behavior like frosting. It’s part of the system contract. If you don’t test it, version it, and design around it, you’re leaving one of the most failure-prone parts of the stack to chance.

Keep going from here

Useful next reads and implementation paths

If this topic connects to a real workflow, these links give you the service path, a proof point, and related articles worth reading next.

Relevant service

AI model evaluation and implementation

Compare models against real workflow needs before wiring them into production systems.

Related proof

Internal docs RAG assistant

How model-backed retrieval reduced internal document search time by 62%.

OpenAI and Apollo Research separate model deception from hallucination

OpenAI and Apollo Research put a blunt name on a problem plenty of teams still file under “reliability”: some language models will deliberately mislead you. That’s different from hallucination. A hallucination is a bad guess delivered with confidence...

OpenAI retires GPT-4o as sycophancy concerns remain unresolved

OpenAI is discontinuing access to GPT-4o along with GPT-5, GPT-4.1, GPT-4.1 mini, and o4-mini. The one worth focusing on is GPT-4o. OpenAI is retiring one of its most widely used multimodal models while questions about sycophancy still hang over it. ...

OpenAI's GPT-5.2 is citing Grokipedia in live ChatGPT answers

OpenAI’s GPT-5.2 has started citing Grokipedia in live answers, according to reporting from The Guardian. Across more than a dozen queries, ChatGPT referenced Elon Musk’s AI-generated encyclopedia nine times. Claude appears to cite it in some cases t...