Why did Anthropic keep rewriting its coding test?

Because each new Claude release solved the static performance screen at or above human level.

What makes a dynamic coding task more robust?

Shifting requirements and novel constraints force real-time reasoning and adaptation.

How can process traces improve candidate evaluation?

They expose actual thought processes via profiling results, instrumentation data, and iterative commits.

Llm January 22, 2026

Anthropic keeps rewriting its coding interview as Claude learns to solve it

Anthropic has a hiring problem that won’t stay confined to Anthropic: its take-home technical screen got good enough for Claude to blow through it. TechCrunch reports that Anthropic engineer Tristan Hume said the company’s performance optimization te...

Anthropic’s coding test keeps breaking because Claude can now ace it

Anthropic has a hiring problem that won’t stay confined to Anthropic: its take-home technical screen got good enough for Claude to blow through it.

TechCrunch reports that Anthropic engineer Tristan Hume said the company’s performance optimization team has had to keep revising a bespoke interview test it started using in 2024, because each new Claude release gets better at solving it. Claude Opus 4 reportedly beat most human applicants. Claude Opus 4.5 matched the best ones.

That tells you something pretty simple. A static coding test, done remotely on a fixed clock, stops being very useful once a frontier model can turn out top-tier work on demand.

Teams still handing out take-homes and grading the final submission as the main artifact have a problem.

Why this kind of test gets eaten first

Anthropic’s original screen focused on systems performance, which sounds specialized enough to hold up. It didn’t.

That tracks. Performance work often has a clear target, a narrow code surface, and fast feedback loops. Cut latency. Reduce allocations. Improve throughput. Stay inside some constraint. Current code models are good at exactly that kind of task because they can generate plausible optimizations quickly and iterate against a benchmark.

They also have absurd pattern coverage. A strong model has effectively absorbed years of public discussion around microbenchmarks, memory layout tricks, concurrency trade-offs, cache-aware code, Python and C++ tuning folklore, pytest and Makefile workflows, benchmark harnesses, and the usual optimization moves a senior engineer would try early.

It doesn’t need to memorize one exact answer. Broad familiarity is often enough.

If your test fits a known problem class, the model starts with a big advantage.

Output matters less than process now

A lot of engineering managers still want the neat artifact. The polished answer. The code they can grade later.

That artifact is worth less than it used to be.

A candidate who turns in an excellent take-home may still be excellent. They may also be decent and very good at steering Claude. For some roles, that’s acceptable. For systems work, infra, incident-heavy backend jobs, or anything that involves debugging under pressure, it’s a weak signal.

The useful signal has shifted toward process:

how someone forms hypotheses
how they measure before and after
how they investigate strange behavior
how they explain trade-offs
how they recover when an optimization breaks correctness
how they work once the task stops looking familiar

That’s harder to fake because it plays out over time. You get traces of actual thinking: profiling runs, instrumentation, dead ends, commit history, questions asked before they’re too late to matter.

A final answer without that context is thin evidence.

What an AI-resistant test looks like

Hume’s team reportedly moved away from hardware-centric optimization and toward a newer task where current models do worse. That makes sense, with one big warning attached.

Novelty only helps if it forces the candidate to build a mental model. If it just means obscure or poorly documented, you’ve built a worse interview.

The stronger versions tend to share a few traits.

The task changes midstream

Static prompts are easy for models. Dynamic tasks are harder.

A candidate starts optimizing for throughput, then halfway through gets a new requirement around p99.9 latency under noisy conditions. Or a dependency starts acting strangely. Or the data distribution changes. Now you’re testing reasoning under changing constraints, not just code generation.

The environment does some work

Give people a sandbox with unfamiliar behavior, odd bottlenecks, or injected faults. Log calls, timing, retries, and resource use. Ask them to explain what the system is doing and why.

That’s closer to actual senior engineering work anyway. A lot of the job is figuring out why a system is lying to you at 3 p.m. and falling over by 5.

Measurement has to be defended

If the task uses a benchmark, candidates should have to justify the benchmark. If they change the measurement setup, that should be visible and reviewable. A model can suggest loop optimizations all day. It’s weaker when the assignment is to prove that the claimed improvement is real.

There can’t be a canned answer online

Once a challenge starts looking like a common benchmark or a public interview problem, its shelf life gets short. Public examples turn into gists, blog posts, GitHub repos, and eventually training data. After that, the test degrades fast.

If your interview problem looks like something that would end up in a Medium post, assume a frontier model already knows the genre.

The ugly trade-off: fair hiring versus reliable evaluation

The obvious fix is proctoring. Live screens. Locked-down environments. No external tools. Maybe recorded sessions.

That helps, but it comes with real costs.

Proctored interviews are harder on candidates with caregiving responsibilities, shaky connectivity, anxiety, or language-processing differences. Long live sessions can also reward comfort under observation instead of engineering judgment. Plenty of strong engineers are bad at thinking out loud while someone watches them fumble a semicolon.

So most companies are going to land on mixed models.

One sensible approach is to stop pretending AI use can be banned everywhere and test both modes on purpose:

a segment where AI tools are allowed and the candidate has to use them well
a segment where they’re restricted and the candidate has to reason directly
a review discussion where the interviewer digs into choices, trade-offs, and mistakes

That gives you a better sense of how someone works in 2026, not how they worked in 2019.

Using AI well is already part of the job in plenty of software roles. So is catching when the model is confidently wrong, shallow, or optimizing the wrong thing.

Why performance tests went first

There’s a deeper technical point here. Performance tasks are unusually vulnerable because they compress neatly into machine-friendly loops.

You can tell a model: reduce memory allocations in this function, preserve behavior, and hit this benchmark target. That’s a bounded search problem with fast feedback. Models are very good at that shape of work.

Tasks with vague requirements, hidden coupling, bad observability, or user-facing ambiguity are messier. They depend on asking good questions, noticing that the benchmark is flawed, or seeing that a local optimization hurts the system elsewhere.

That’s why interview design is drifting toward interaction-heavy systems exercises. Those tasks still preserve some signal.

A remote take-home where the whole challenge is "produce good code" has become easy to outsource to a model. A task where the challenge is "understand a weird system and defend your decisions with evidence" still has some bite.

What teams should change now

If you’re hiring senior engineers, a few moves make sense right away.

First, retire any public or long-lived take-home that depends mostly on final output quality. If candidates can paste it into Claude and get a high-end answer, the test is finished.

Second, build dynamic variants. Change data distributions, constraints, or failure modes per candidate. Treat interview tasks like software that needs maintenance, rotation, and access control.

Third, instrument the environment. git history, test runs, profiler usage, and notes can be useful if you review them with restraint and a clear policy. You’re trying to see engineering behavior, not build a surveillance machine.

Fourth, ask for short technical narratives. Why did latency drop? What changed in allocation behavior? What trade-off did you accept? Strong engineers can usually explain results without hiding behind jargon.

Fifth, pick an AI policy and state it plainly. If AI is allowed, say so and grade accordingly. If it isn’t, control the environment enough to make the rule real.

Anthropic’s problem is everybody’s problem

There’s some irony in Anthropic having to redesign its interview because Claude keeps wrecking it. Still, this is useful honesty. A lot of companies are dealing with the same issue and just aren’t saying it out loud.

The important part isn’t that Claude is good at coding tests. That was predictable.

The important part is that a company building frontier models is showing, in practice, that the standard remote coding screen no longer measures what many employers think it measures.

A lot of hiring loops still assume a world where assistance is limited and answers are scarce. That world is gone. If your interview process hasn’t changed, it’s grading the wrong thing.

Keep going from here

Useful next reads and implementation paths

If this topic connects to a real workflow, these links give you the service path, a proof point, and related articles worth reading next.

Relevant service

AI model evaluation and implementation

Compare models against real workflow needs before wiring them into production systems.

Related proof

Internal docs RAG assistant

How model-backed retrieval reduced internal document search time by 62%.

Anthropic Cowork brings Claude file editing to Desktop without the CLI

Anthropic has rolled out Cowork for Claude Desktop, a feature that lets Claude read and edit files in a folder you explicitly choose. The appeal is obvious. It gives people some of what Claude Code can do without making them touch a CLI, set up a san...

Anthropic's Claude Opus 4.5 adds Chrome and Excel, clears 80% on SWE-Bench

Anthropic has released Opus 4.5, its new top-end Claude model, with two additions that matter more than the usual benchmark dump: Chrome integration and Excel integration. It’s also the first model to clear 80% on SWE-Bench Verified, which is a real ...

Anthropic cuts new Windsurf API access as OpenAI acquisition talks surface

Anthropic has cut off new public access to Windsurf, the coding assistant built on Claude. At TC Sessions: AI, Anthropic CSO Jared Kaplan confirmed the shutdown. The reported reason is strategic: OpenAI is rumored to be acquiring Windsurf, and Anthro...