Does this ruling apply to all copyrighted texts in LLM training?

No; fair use depends on transformation and data handling practices, and separate issues like storing pirated copies are not covered.

How can I test my model for memorization risks?

Implement fingerprinting and sampling techniques to detect verbatim outputs of distinctive passages.

What is next for the central library issue?

That aspect will proceed to trial to determine if assembling or storing pirated books violates copyright law.

Llm June 25, 2025

Bartz v. Anthropic: Court Backs Fair Use for LLM Training on Books

A U.S. federal judge has handed AI companies their strongest court win yet on training data. In Bartz v. Anthropic, Judge William Alsup ruled that Anthropic’s use of published books to train large language models can qualify as fair use under U.S. co...

A federal court just gave AI training a fair use win, with a big catch

A U.S. federal judge has handed AI companies their strongest court win yet on training data. In Bartz v. Anthropic, Judge William Alsup ruled that Anthropic’s use of published books to train large language models can qualify as fair use under U.S. copyright law.

That matters right away. Pretraining depends on huge text corpora, and a lot of that text has been sitting in a legal gray zone. This ruling says the act of training on copyrighted books, at least on these facts, can be lawful because it’s transformative.

The decision is narrower than the headline. Alsup also split out a separate issue for trial: Anthropic’s reported “central library” of pirated books. If that library was assembled or stored unlawfully, Anthropic could still face statutory damages there even if model training itself gets fair use protection.

For anyone building models, data pipelines, or internal AI products, that distinction matters.

Why this ruling matters

This is the first clear U.S. district court endorsement of a position the AI industry has been pushing for two years: training a model on copyrighted text is different from copying that text for people to read.

The court accepted the basic technical logic. A pretrained LLM doesn’t keep books on a shelf and hand them back on request. It digests tokens, updates weights, and learns statistical relationships between words, syntax, style, and semantics. That abstraction step sits at the center of the fair use argument.

That’s not some edge-case theory. It’s how modern foundation models are built.

A typical pretraining pipeline takes raw text, tokenizes it, chunks it, filters obvious junk, deduplicates it, and feeds it through a huge optimization loop. The end product is a parameterized model, not a searchable archive of the source corpus. In plain English, the system learns from the books. It doesn’t republish the books.

Courts have seen versions of that logic before. Search indexing cases leaned on it. So did parts of the Google Books litigation. What’s new is a judge applying it directly to LLM training at industrial scale.

The technical argument the court accepted

Anthropic’s defense rests on the idea that model training converts source text into internal representations.

That sounds abstract, but the mechanics are simple enough:

Raw text gets turned into tokens
Tokens are used to predict neighboring tokens
The model updates weights to reduce prediction error
Across billions of examples, it learns patterns rather than storing pages intact

The distinction has limits, and critics are right to press on them. LLMs can memorize. They can regurgitate passages, especially from repeated or distinctive training examples. That’s been shown often enough that nobody serious disputes it.

Still, memorization risk and the legality of pretraining are separate questions. Alsup appears to have accepted that the primary function of training is analytical and generative, not archival.

For engineers, the practical point is pretty obvious. The fair use case gets stronger when your data handling matches the story you’re telling in court.

If your pipeline keeps complete copyrighted works in durable internal stores, loses provenance, and ships models that can spit back long passages verbatim, you’re giving plaintiffs better facts. If your pipeline deduplicates aggressively, strips boilerplate, tracks source provenance, and tests for memorization, you’re in a better position.

That’s not compliance theater. It’s basic operational discipline.

Don’t gloss over the pirated library issue

The ruling does not bless sloppy data collection.

The separate trial over Anthropic’s alleged stash of pirated books may end up being just as important as the fair use finding. Courts can treat the use of content and the acquisition or retention of content as different acts. A company could win on training and still lose on how the corpus was assembled.

That creates an awkward but very real compliance model for AI firms:

The model training step may be defensible
The ingestion path may still be risky
The internal archive may be riskier than either

A lot of teams are still weak here. They know how to build scalable ETL for text. They’re much worse at proving where each document came from, under what license, when it was fetched, and whether it was later removed.

Data provenance is boring until it becomes the whole case.

A serious training stack now needs metadata at the document level, not just object storage and vague assurances. At minimum: source_url, acquisition method, fetch date, claimed rights status, jurisdiction, and whether the text survives in raw form after preprocessing.

Tools like DVC, LakeFS, Git LFS, or custom lineage systems can help, but versioning alone doesn’t solve the legal problem. The hard part is policy enforcement. Can you prove that takedown requests flow through the pipeline? Can you rebuild a training run with the exact corpus snapshot used? Can you show that a disputed book was excluded from later checkpoints?

Most orgs can’t. Not cleanly.

What changes for model builders

For frontier labs, the ruling gives useful cover. For smaller AI companies, it may matter even more.

Big labs can absorb litigation costs and cut licensing deals when it suits them. Startups usually can’t. A district court ruling that says book-based pretraining can be fair use lowers one barrier to entry, at least in the U.S. It doesn’t erase risk, but it makes the legal posture less one-sided.

Still, nobody should read this as permission to ingest everything they can find.

A better approach looks like this:

Keep the corpus mixed

Public domain text, permissive licenses, web text with documented provenance, customer-owned data, and selectively licensed material still make sense together. Fair use helps. Redundancy helps more. If part of your dataset gets challenged later, you don’t want the whole run hanging on it.

Audit for memorization

If your model can emit long copyrighted passages on benchmark prompts, you’ve got a product problem and probably a litigation problem. Memorization testing should sit next to evals for toxicity, hallucinations, and jailbreak resistance.

Separate raw storage from training artifacts

This case puts a spotlight on the “library,” not just the model. Teams should revisit retention policies for raw text. Do you need to keep every source document indefinitely? Often, no. If you do, write down why.

Treat fine-tuning differently

Pretraining has the strongest fair use argument because it’s broad, statistical, and non-expressive in intent. Fine-tuning on narrow corpora can look different, especially if it pushes a model toward emulating a living author’s style or reproducing specific source material. Same infrastructure, different legal posture.

The weak spots in the ruling

This is a district court ruling, not a final national standard. Other judges can disagree. Appeals can reshape the reasoning. Fair use is famously fact-specific, which is lawyer language for: don’t generalize too far.

There’s also a practical issue the AI sector keeps trying to sidestep. Even if training is fair use, trust with creators is still broken.

Publishers and authors aren’t going to stop pushing for licensing markets because one judge accepted the transformative-use argument. If anything, this may speed up direct licensing deals for premium corpora, especially in domains where quality and freshness matter more than sheer scale. Legal permission and business incentive aren’t the same thing.

Some of the technical coping ideas floating around are weak, too. Metadata scrubbing doesn’t solve copyright. Sentence shuffling or light paraphrase before training doesn’t sanitize a source corpus. Differential privacy may reduce memorization, but it comes with utility costs and isn’t standard practice at frontier scale. Engineers should be wary of compliance theater dressed up as ML technique.

What technical leads should do now

If you own a training pipeline or sign off on AI procurement, this ruling should trigger a review in three places.

First, ingestion. Know what enters the corpus, how it got there, and what rights posture you’re assuming.

Second, retention. Decide what raw material needs to be stored, for how long, and under what access controls.

Third, output risk. Test whether the model reproduces source text, and build mitigation into evals and release gates.

If your team is training on third-party corpora assembled by vendors, press them on provenance. “Commercially usable” is not documentation. Ask for lineage, exclusions, takedown handling, and whether disputed material remains in archive storage.

The court gave AI companies a stronger fair use argument. It did not give anyone a pass for bad data governance.

That’s the useful read on this ruling. Training got a legal win. Data acquisition, retention, and model behavior are still very much live issues.

Keep going from here

Useful next reads and implementation paths

If this topic connects to a real workflow, these links give you the service path, a proof point, and related articles worth reading next.

Relevant service

AI automation services

Design AI workflows with review, permissions, logging, and policy controls.

Related proof

Marketplace fraud detection

How risk scoring helped prioritize suspicious marketplace activity.

Federal courts back AI training fair use in Anthropic copyright cases

Federal judges in California and New York just gave AI companies an early win on one of the biggest legal questions in the industry: can you train a model on copyrighted material without permission? Right now, the answer looks more favorable to AI la...

Anthropic pulls Fable 5 and Mythos 5 after US export control order

Anthropic has pulled its two newest AI models, Fable 5 and Mythos 5, after an export control order from the Trump administration cited unspecified national security concerns. Fable 5 was the broader public release. Mythos 5 was available to existing ...

NSA reportedly uses Anthropic Mythos Preview for vulnerability discovery

The Pentagon drama around Anthropic is getting the headlines. The more important detail is that the NSA is reportedly already using a restricted frontier model for vulnerability discovery. Axios reported that the NSA has access to Mythos Preview, Ant...