What is SWE-Bench Verified and why is 80% significant?

SWE-Bench Verified is an end-to-end coding benchmark that tests issue comprehension, code patching, and passing real project tests; clearing 80% shows robust real-world coding automation.

Who can access the Chrome and Excel integrations in Opus 4.5?

Claude for Chrome is available to Max users, while Claude for Excel is offered to Max, Team, and Enterprise customers.

What safeguards are in place for browser and spreadsheet automation?

Anthropic plans domain allowlists, sandboxed downloads, audit logs, blocked macros, change logs, and sensitivity tagging to prevent misuse and silent errors.

Llm November 28, 2025

Anthropic's Claude Opus 4.5 adds Chrome and Excel, clears 80% on SWE-Bench

Anthropic has released Opus 4.5, its new top-end Claude model, with two additions that matter more than the usual benchmark dump: Chrome integration and Excel integration. It’s also the first model to clear 80% on SWE-Bench Verified, which is a real ...

Anthropic’s Opus 4.5 pushes Claude deeper into real desktop work

Anthropic has released Opus 4.5, its new top-end Claude model, with two additions that matter more than the usual benchmark dump: Chrome integration and Excel integration. It’s also the first model to clear 80% on SWE-Bench Verified, which is a real coding milestone because that test measures end-to-end repository work, not polished prompt demos.

Put those together and Anthropic’s direction is pretty clear. Claude is being pushed toward actual task execution: read the issue, patch the code, call tools, check results, then move into a browser or spreadsheet when the work leaves the IDE.

That’s the part teams using coding agents will care about.

The benchmark worth paying attention to

Most model launch charts don’t tell you much about real use. SWE-Bench Verified does.

Crossing 80% matters because the benchmark asks the model to work like an engineer inside a real repo. It has to understand the issue, inspect the code, produce a patch, and pass the project’s tests. That’s far closer to useful automation than one-shot code generation.

Anthropic says Opus 4.5 also scores well on:

Terminal-bench for shell work
tau2-bench and MCP Atlas for tool use
ARC-AGI 2 and GPQA Diamond for reasoning

Some of that is standard launch packaging. Some of it matters. Terminal-bench, for one, is a decent signal for whether a model can survive actual engineering work, where commands fail, paths are wrong, and outputs have to be interpreted instead of hand-waved away. If those results hold up, Opus 4.5 should be better at recovering from failure instead of bluffing past it.

That’s the useful question. Can it debug?

Chrome and Excel are the practical part

The more interesting move here is product integration.

Anthropic is rolling out:

Claude for Chrome for Max users
Claude for Excel for Max, Team, and Enterprise users

That matters because a lot of work still happens outside the coding environment.

Chrome is the obvious example. Internal tools, vendor dashboards, docs, support consoles, analytics UIs, random admin panels. Even engineering work ends up there. If Claude can reliably read pages, pull structured data, summarize docs, validate links, and manage multi-tab workflows, it gets useful in a much wider slice of day-to-day work.

The risk is familiar. Browser agents are brittle. Pages change. Selectors break. Permissions get messy. A model that can click around your browser can also do dumb things very quickly if the controls are loose. Anthropic seems to know that. The hard part isn’t shipping Chrome support. It’s whether admins can fence it in with domain allowlists, sandboxed downloads, and audit logs.

Excel may be even more grounded. Finance, ops, rev teams, supply chain, BI, and a lot of engineering-adjacent work still run through spreadsheets. If Claude can propose Power Query transformations, generate VBA, explain formulas, and check spreadsheet logic against sample sheets, that’s useful immediately.

It also needs boring safeguards. Spreadsheet work looks safe right up until it isn’t. Silent errors are common, and models are very good at sounding confident while being wrong in exactly the ways that make spreadsheets dangerous. If Anthropic wants this to land in production, it needs change logs, sensitivity tagging, and macro execution blocked by default.

Memory is the harder problem

Another notable part of Opus 4.5 is memory.

Anthropic says the model includes memory upgrades for long-context work and an endless chat feature for paid users. When a thread gets close to the context limit, Claude compresses earlier conversation so the session can keep going.

That sounds straightforward. It isn’t.

Anyone who has used LLMs on a large codebase has seen the failure mode. The first stretch goes well, then the model starts dropping constraints, editing the wrong module, or reintroducing a bug it already fixed. A larger context window helps, but the real problem is deciding what survives when a conversation gets long.

Anthropic’s head of product management for research, Dianne Na Penn, put it plainly:

“Knowing the right details to remember is really important in complement to just having a longer context window.”

That’s the right emphasis. Selective retention matters more than big token-count claims. If the compression layer keeps decisions, identifiers, constraints, and file paths while discarding noise, long-running sessions get more stable. If it misses those details, “endless chat” becomes elegant amnesia.

Developers should treat this compression as lossy and plan accordingly. Keep important facts in a scratchpad, explicit memory note, or tool output the model can revisit. Endless chat is convenient. It’s not persistence.

MCP still matters

Anthropic is also still leaning hard on Model Context Protocol, or MCP, as the connective layer for tool use.

That matters because agent systems are still full of throwaway glue code. Every team ends up building some variation of the same wrappers for file access, test runs, browser actions, data fetches, and approval flows. MCP is Anthropic’s attempt to standardize that contract so models can discover tools, understand schemas, and call them across environments in a reusable way.

For engineers, the appeal is obvious: fewer custom integrations, more portable agent setups, cleaner tool definitions. A tool like this is straightforward for a model to reason about:

{
"tool_name": "run_tests",
"description": "Execute repo test suite and return pass/fail summary",
"input_schema": {
"type": "object",
"properties": {
"path": { "type": "string" },
"cmd": { "type": "string", "default": "pytest -q" }
},
"required": ["path"]
}
}

That alone doesn’t fix reliability. Models still need planning, retries, backtracking, and guardrails. But standardizing tool contracts is one of the few parts of the current agent stack that looks like actual engineering progress instead of benchmark theater.

There’s a strategic angle too. If MCP sticks, Anthropic gets influence over part of the interface layer between models and external systems. That’s a good position to be in.

Where Opus 4.5 fits

Opus 4.5 lands in a crowded top tier, up against OpenAI’s GPT 5.1 and Google’s Gemini 3. Capability is close enough now that most buying decisions won’t turn on one benchmark.

They’ll come down to four things:

Tool reliability
Latency under multi-step workloads
Governance and auditability
How well the model fits the tools people already use

That last point may be where Anthropic made the smartest call. Browser and spreadsheet work aren’t flashy, but they’re where companies actually lose time. A model embedded in Chrome and Excel has a much clearer route to daily use than another abstract reasoning gain.

There’s also less room for sloppiness. A bad answer in chat is annoying. A bad browser action or spreadsheet transformation can do real damage.

Before this goes into production

If you’re evaluating Opus 4.5 for coding agents or workflow automation, the basics still apply.

Keep the planner separate from the executor. Make the model produce a short plan before it touches files or tools. Use strict tool schemas. Whitelist paths. Restrict shell commands. Log every action, argument, result, and timestamp.

For long sessions, persist key facts outside chat history. Project root, target module, acceptance tests, blocked actions, and known bad approaches should live in explicit memory slots or scratchpad state. Don’t assume the compression layer will preserve them.

For Chrome use, add domain allowlists, disable form autofill, and sandbox downloads. For Excel, tag sensitive sheets, require review for formula-wide changes, and keep macro execution off unless there’s a clear approval path.

Treat the model like a distributed system component with weird failure modes. Because that’s what it is.

The short version

Opus 4.5 looks like a serious release.

The SWE-Bench Verified result is strong. The memory work targets a real weakness in agentic coding. Chrome and Excel support show Anthropic understands where automation has to go if Claude is going to become part of everyday workflows instead of living as a side-panel assistant.

The limits are obvious too. Browser actions are fragile. Spreadsheet mistakes are expensive. “Endless chat” still depends on lossy compression. And no benchmark score removes the need for guardrails.

Still, this is a clear step away from chat-only AI and toward systems that actually do work. Anthropic’s version of that shift looks practical, not flashy. That’s probably the right choice.

Keep going from here

Useful next reads and implementation paths

If this topic connects to a real workflow, these links give you the service path, a proof point, and related articles worth reading next.

Relevant service

AI model evaluation and implementation

Compare models against real workflow needs before wiring them into production systems.

Related proof

Internal docs RAG assistant

How model-backed retrieval reduced internal document search time by 62%.

Anthropic keeps rewriting its coding interview as Claude learns to solve it

Anthropic has a hiring problem that won’t stay confined to Anthropic: its take-home technical screen got good enough for Claude to blow through it. TechCrunch reports that Anthropic engineer Tristan Hume said the company’s performance optimization te...

Anthropic acqui-hires Humanloop founders and enterprise LLM tooling team

Anthropic has hired Humanloop’s co-founders, Raza Habib, Peter Hayes, and Jordan Burgess, along with much of the team behind the startup’s enterprise LLM tooling. This is an acqui-hire, not a product acquisition. Humanloop’s assets and IP aren’t part...

Anthropic brings Claude 4.5 to Snowflake in a $200 million multiyear deal

Anthropic has signed a $200 million multi-year deal with Snowflake to bring Claude 4.5 models into Snowflake’s data cloud. Claude Sonnet 4.5 will power Snowflake Intelligence, and customers will also get Claude Opus 4.5 for heavier reasoning and mult...