What is the new context window size for Claude Sonnet 4?

It now supports up to 1,000,000 tokens.

How much does Anthropic charge for the extended window?

After 200,000 tokens, it costs $6 per million input tokens and $22.50 per million output tokens.

What techniques maintain performance at a million tokens?

Techniques include sparse attention, memory compression, extended positional encodings, and chunked processing.

Llm August 14, 2025

Anthropic expands Claude Sonnet 4 to a 1M-token context window

Anthropic has expanded Claude Sonnet 4 to a 1,000,000-token context window for API users, available through Amazon Bedrock and Google Cloud Vertex AI. That’s up from 200,000 tokens. On paper, it puts Sonnet 4 ahead of OpenAI GPT-5’s 400,000-token win...

Anthropic pushes Claude Sonnet 4 to 1 million tokens, and that changes the shape of coding agents

Anthropic has expanded Claude Sonnet 4 to a 1,000,000-token context window for API users, available through Amazon Bedrock and Google Cloud Vertex AI. That’s up from 200,000 tokens. On paper, it puts Sonnet 4 ahead of OpenAI GPT-5’s 400,000-token window, though still behind Google Gemini 2.5 Pro’s 2 million and far short of Meta’s 10 million-token Llama 4 Scout headline figure.

The raw number only tells you so much. But a million tokens is enough to change how teams can use these systems for code. You can drop in a monorepo, architecture docs, CI logs, and a long history of prior edits without constantly swapping pieces in and out. For agent-driven coding workflows, that’s a meaningful shift.

Anthropic is charging for it. Past 200,000 tokens, pricing rises to $6 per million input tokens and $22.50 per million output tokens, up from $3 input and $15 output at lower context sizes. You can hand Claude a giant codebase. You probably don’t want to do it casually.

Why developers should care

For chat, long context is mostly convenience. For code agents, it starts to look like infrastructure.

Current coding tools keep hitting the same limit. They can work through a few files well enough, but large repo-wide changes still fall apart. The model loses track of assumptions, fixes one subsystem and breaks another, forgets a design constraint buried in a spec, or misses a migration script sitting a few directories over.

A 1M-token window raises the ceiling on repo-scale coherence. Instead of constantly juggling retrieval chunks and summaries, the model can keep more of the real source material in play in one session:

repo structure
core interfaces
design docs
recent diffs
failing tests
stack traces
deployment notes
previous tool outputs

That cuts down on context thrash. In practice, it should mean fewer brittle fixes and fewer edits built on stale assumptions.

If you’re building internal coding agents, that’s the part worth watching. Bigger context helps with breadth and continuity at the same time.

The real question is effective context

Anthropic is framing this around effective context, which is the right term.

Anybody can publish a giant token limit. The harder part is getting the model to actually use information spread across that window, especially material in the middle. Long-context models still have a habit of dropping details once prompts get huge. “Lost in the middle” remains a real issue.

So the million-token number matters only if the model can keep the right details salient, pull the right signals from the noise, and avoid getting distracted by junk. That comes down to model design and inference tricks Anthropic hasn’t fully disclosed, but the usual set of techniques includes:

sparse or windowed attention
memory compression
extended positional encoding such as RoPE scaling or ALiBi-style methods
grouped-query or multi-query attention to reduce KV overhead
chunked processing with carried state
aggressive KV cache management

You do not get to million-token prompts in production with vanilla transformer attention and hope for the best. The compute and memory bill would be rough.

That’s also why raw context rankings are getting less useful. Gemini 2.5 Pro may advertise 2 million tokens. Llama 4 Scout may advertise 10 million. If task completion drops, latency spikes, or the model starts missing central details, the number loses a lot of its shine.

For enterprise teams, the useful metric is simpler: Can the model finish a large coding task correctly without constant babysitting? That’s harder to benchmark, but it’s the question that matters.

Anthropic is pushing deeper into coding

This didn’t arrive on its own. Anthropic recently updated Claude Opus 4.1 with a coding focus, and now Sonnet 4 gets a major context jump. The direction is pretty obvious.

Coding is where usage gets sticky, expensive, and tied to real operational work. It’s also where OpenAI, Cursor, GitHub Copilot, Windsurf, and others are competing hard. OpenAI has benchmark weight and distribution. Cursor has become a default environment for plenty of engineers. Anthropic seems to be pressing on an area where Claude already has a decent reputation: code reasoning and long-running tasks.

A million-token Sonnet fits that strategy. It gives Anthropic a concrete pitch for API buyers and platform teams building internal developer tools. And the rollout through Bedrock and Vertex AI matters. Big companies want governance, private networking, auditability, and fewer procurement headaches.

That part isn’t glamorous. It’s how enterprise adoption usually works.

The trade-off is cost and latency

Long context looks great until you see the bill.

At the new rate, an 800,000-token input costs several dollars before any output shows up. If the model produces long responses, tool plans, or patch explanations, output costs climb quickly. That may be acceptable for high-value engineering work. For routine autocomplete-style tasks, it’s wasteful.

Latency is the other issue. Even with optimized attention schemes, million-token prompts are heavy. If your workflow depends on quick interactive cycles, giant context can make the product feel sluggish. That’s especially true for coding agents that repeatedly read files, run tests, inspect logs, and try again.

The better pattern is probably progressive loading, not dumping the whole repo in every time.

Start with:

a repo map
architecture notes
key interfaces
top failing tests
the handful of files most likely involved

Then pull in more material through tools when you need it. That keeps cost down and trims noise. It also gives the model a better chance of staying focused.

Teams that treat a 1M-token window as permission to stop curating inputs are going to spend more money for worse results.

RAG still matters

This update doesn’t make retrieval-augmented generation obsolete. It changes its job.

With smaller windows, RAG often acts as a strict gatekeeper. You retrieve a few chunks and hope they were the right ones. With a million tokens available, retrieval can assemble a broader working set instead of forcing a narrow bet.

That matters for codebases, where relevant context is often scattered across entry points, interfaces, test fixtures, config files, and migration scripts. A vector search index or static analysis pass can still identify candidate files. The difference is that you can now include more of them directly, along with the surrounding context that makes them intelligible.

So RAG still matters. It becomes more of a front-end filter for a much larger prompt budget.

That fits software engineering better anyway. Code tasks usually fail for one of two reasons: the model didn’t have enough context, or it didn’t have the right context. Bigger windows help with the first. Good retrieval still matters for the second.

Bigger prompts widen the security problem

There’s an obvious downside to whole-repo prompts. You’re sending a lot more sensitive material through the model path.

That can include secrets in config files, internal architecture docs, customer-specific business logic, compliance notes, and stray PII buried in logs. In agent workflows, the risk gets worse because tool outputs are fed back into context. One bad artifact can contaminate the next step.

Security teams should assume that larger prompts increase both data leakage risk and the prompt injection surface.

A few guardrails matter:

run DLP checks before content enters the prompt
strip secrets and credentials from source, logs, and environment dumps
treat third-party docs, READMEs, and comments as untrusted input
use stable references to objects where cloud tooling supports it instead of re-sending raw content
keep access controls aligned with repo boundaries, not just model endpoints

This is another reason the Bedrock and Vertex rollout matters. They give enterprises a more familiar control plane for audit logs, policy enforcement, and network isolation. That won’t fix prompt injection. It does make the deployment story easier to defend internally.

What teams should do with it

If you run an AI platform team or you’re building internal coding agents, the next step is straightforward: retest your architecture assumptions.

A lot of current systems are built around 100k to 200k token ceilings. They summarize aggressively, trim history hard, and lean heavily on retrieval to stay under budget. Some of that scaffolding may now be too conservative, especially for long-running refactors and debugging sessions.

Don’t overcorrect.

Use the extra context where it actually pays off:

multi-file refactors
repo onboarding
debugging with lots of logs and traces
migration planning
keeping agent state coherent across longer runs

Keep prompts structured. Leave headroom for tool output. Pin non-negotiable instructions in more than one place. Ask the model to cite file paths and identifiers so you can check whether it’s grounding correctly.

And watch the economics. A giant context window is useful when it cuts failed iterations and human cleanup. If it just makes prompts fatter, you’re paying for theater.

Anthropic has made Claude Sonnet 4 more capable in a way that should matter to real engineering teams. The headline is the million-token window. The harder question is whether teams use that headroom to build better coding systems, or just pricier ones.

Keep going from here

Useful next reads and implementation paths

If this topic connects to a real workflow, these links give you the service path, a proof point, and related articles worth reading next.

Relevant service

AI model evaluation and implementation

Compare models against real workflow needs before wiring them into production systems.

Related proof

Internal docs RAG assistant

How model-backed retrieval reduced internal document search time by 62%.

DuckDuckGo expands Duck.ai subscription with GPT-5, Claude Sonnet 4, and Llama

DuckDuckGo has expanded its $9.99-a-month subscription bundle with newer AI models in Duck.ai, including OpenAI’s GPT-5, GPT-4o, Anthropic’s Claude Sonnet 4, and Meta’s Llama Maverick. Free users still get a smaller set: Claude 3.5 Haiku, Meta Scout,...

Microsoft adds Claude Opus 4.1 and Sonnet 4 to Copilot for business

Microsoft is adding Anthropic’s Claude models to Copilot for business, including Claude Opus 4.1 and Claude Sonnet 4, alongside OpenAI’s reasoning models. That pushes Copilot in a different direction. For most of the enterprise AI rush, assistant pro...

Anthropic keeps rewriting its coding interview as Claude learns to solve it

Anthropic has a hiring problem that won’t stay confined to Anthropic: its take-home technical screen got good enough for Claude to blow through it. TechCrunch reports that Anthropic engineer Tristan Hume said the company’s performance optimization te...