Llm December 19, 2025

Adobe faces proposed class action over SlimLM training on pirated books

Adobe is facing a proposed class-action lawsuit over how it trained SlimLM, its compact language model for on-device document assistance. The complaint, filed on behalf of Oregon author Elizabeth Lyon, says Adobe used pirated copies of books during p...

Adobe faces proposed class action over SlimLM training on pirated books

Adobe’s SlimLM lawsuit puts the dirty secret of “open” training data back on the table

Adobe is facing a proposed class-action lawsuit over how it trained SlimLM, its compact language model for on-device document assistance. The complaint, filed on behalf of Oregon author Elizabeth Lyon, says Adobe used pirated copies of books during pretraining, with that material coming through SlimPajama-627B, a large open dataset released by Cerebras in 2023.

The legal issue is obvious. The more interesting point is where SlimLM sits. This is the part of the market sold as safer: small models running locally, helping with PDFs, forms, and files on phones and laptops. Better privacy. Lower serving costs. Easier enterprise pitch.

That sales story gets shaky if the pretraining corpus is shaky too.

Why this case stands out

Most AI copyright fights have focused on frontier models. Huge clusters, huge crawls, huge exposure. Adobe’s case points at something the industry has blurred for years: compact models still run on the same messy data pipeline.

SlimLM is described as a small model for document assistance on mobile devices. In practice, that usually means a decoder-only Transformer tuned for jobs like:

  • summarizing PDFs and scans
  • extracting fields from invoices and forms
  • answering questions about local files
  • drafting or editing short text

To make a model like that useful, you still need broad, high-quality text in pretraining. Books are attractive because they’re clean, structured, and long enough to teach coherence across multiple paragraphs. Small models tend to benefit more from that than larger ones do. They have less capacity to brute-force their way through noisy web data.

So the pressure is obvious. The data that helps quality most often carries the most legal risk.

Deduplication doesn’t tell you where the data came from

Adobe’s public description says SlimLM was pretrained on SlimPajama-627B, a deduplicated multi-corpora dataset. Deduplication is good practice. It cuts repeated text, reduces wasted training tokens, and keeps frequency distortions in check.

It says nothing about licensing.

If a pirated book appears 500 times across scraped sources, deduplication may leave one cleaner copy behind. That helps training quality. It does nothing for rights clearance.

That was never a minor technical detail. It was the problem. Dataset maintainers cleaned the corpus. Model builders used it. Product teams shipped. The risk got pushed upstream until upstream became the whole issue.

“We used a cleaned open corpus” is not much of a data story. It’s a trust assumption.

That’s what this lawsuit hits. Adobe’s choices, yes, but also the broader habit of treating open pretraining corpora as provenance-checked because they were deduplicated, documented, and widely used.

What SlimLM probably looks like under the hood

Adobe’s exact implementation details should come from its own paper, but the category is familiar by now. A mobile-oriented LM in this class probably uses a compact decoder-only architecture with the usual efficiency stack:

  • relatively small parameter counts, likely sub-1B to low billions depending on target hardware
  • RMSNorm and rotary positional embeddings
  • some form of grouped, sliding, or otherwise constrained attention to keep inference practical
  • int8 or int4 quantization for deployment
  • distillation from a larger teacher model
  • optional retrieval over device files or enterprise content

That last point matters. A document assistant shouldn’t rely on pretraining alone. If the system is summarizing a contract or answering questions about a PDF, the product should retrieve from the user’s actual content. Pretraining gives you language fluency and task priors. Retrieval gives you grounding.

That’s one practical way to reduce exposure here. If you narrow the model’s job and ground it in user-owned or licensed corpora, you reduce how much general long-form knowledge needs to be baked into the weights. That doesn’t remove pretraining risk, but it does lower the pressure to ingest every book-like text source you can find.

Why books matter so much for small models

A lot of engineers know this from evals already: removing long-form books from pretraining can hurt small models more than many teams want to admit.

You see it in:

  • paragraph-to-paragraph coherence
  • long summary quality
  • stylistic consistency
  • instruction following over extended context
  • structured drafting tasks

Larger models can sometimes absorb that loss better because they’ve seen enough other high-quality data and have enough capacity to generalize. Small models have less room for bad data choices. Quality shows up in the outputs fast.

That’s why vendors keep coming back to books, licensed or not. They’re efficient training material. If those sources become legally radioactive, expect a few predictable responses:

  • more public-domain book corpora
  • more publisher licensing deals
  • heavier distillation from larger, better-curated teacher models
  • more synthetic data, though synthetic text still tends to flatten style and repeat model errors
  • tighter product scope, with retrieval doing more of the work

None of that fully replaces broad, high-quality long-form text collected at web scale. That’s the awkward constraint sitting under a lot of current model roadmaps.

This is an engineering problem now

Senior engineers and ML leads should stop treating training data provenance like a policy appendix. It’s an operational requirement.

The old workflow was simple: get a dataset, run filtering, dedup it, train, write a model card, move on. That’s too thin for 2026. If you can’t explain where your corpus came from, what was excluded, and how you know, the risk stops being theoretical. It can block releases, sales, procurement, and partnerships.

What teams need now looks a lot closer to software supply chain discipline.

A sane baseline for data governance

If you’re building or buying compact LMs, ask for:

  • a training data summary with source categories
  • dataset versions and immutable hashes
  • exclusion rules for high-risk sources
  • some kind of data SBOM, even if the format is still rough
  • documented license status, or at least a confidence tier for license provenance
  • takedown and remediation workflows

At the pipeline level, that means boring but necessary controls:

  • SHA-256 manifests for dataset shards
  • per-sample or per-source provenance tags
  • deny lists for known shadow-library domains
  • near-duplicate detection with MinHash, SimHash, or LSH-based systems
  • classifiers for book-like or archive-like content
  • snapshots you can actually reproduce months later

This work is tedious. It’s also the difference between “we think it’s fine” and “we can show what went into training.”

Procurement is about to get sharper

Enterprises already ask hard questions about hosting, retention, and data residency. Training provenance is moving into the same bucket.

Expect buyer checklists to expand quickly, especially in regulated sectors. If a vendor ships a local document assistant trained on a vaguely described open corpus, legal and security teams are going to push back. They’ll want documentation that looks a lot more like a software bill of materials, except for data.

That pressure will hit open dataset projects too. Community corpora have mattered because they lowered the cost of entry for research and product teams. They also inherited the web’s licensing chaos. “Open” won’t carry the same weight going forward. Teams are going to ask for source inventories, exclusion criteria, and proof that known high-risk content was screened out.

The EU’s transparency rules around copyrighted material used in training add another layer. US policy is still less settled, but corporate counsel doesn’t need a final federal standard to treat this as material risk.

What developers should do now

If you own a model pipeline, or you’re choosing a vendor, the practical moves are straightforward.

First, tighten scope. A document assistant should lean heavily on retrieval over user or enterprise content. Don’t ask the base model to do extra work it doesn’t need to do.

Second, treat compact models as supply chain artifacts. Keep a provenance manifest alongside the model card, evals, and deployment metadata.

Third, pressure-test your quantized deployment targets against realistic hardware. int8 is still the safer default for broad mobile support. int4 buys memory headroom, but quality can wobble, especially on OCR-heavy or extraction-heavy tasks. If you need stable behavior, calibrate carefully during PTQ or use QAT where the budget allows it.

Fourth, assume that “cleaned and deduplicated” is not a sufficient answer from a vendor.

Ask where the text came from. Ask what was excluded. Ask how they know.

Adobe’s lawsuit won’t settle the copyright mess on its own. It does underline a point engineers should already have internalized: model quality, product fit, and legal exposure now run through the same dependency. The corpus is part of the architecture. If you can’t account for it, you don’t fully control the system you’re shipping.

Keep going from here

Useful next reads and implementation paths

If this topic connects to a real workflow, these links give you the service path, a proof point, and related articles worth reading next.

Relevant service
AI automation services

Design AI workflows with review, permissions, logging, and policy controls.

Related proof
Marketplace fraud detection

How risk scoring helped prioritize suspicious marketplace activity.

Related article
Bartz v. Anthropic: Court Backs Fair Use for LLM Training on Books

A U.S. federal judge has handed AI companies their strongest court win yet on training data. In Bartz v. Anthropic, Judge William Alsup ruled that Anthropic’s use of published books to train large language models can qualify as fair use under U.S. co...

Related article
Federal courts back AI training fair use in Anthropic copyright cases

Federal judges in California and New York just gave AI companies an early win on one of the biggest legal questions in the industry: can you train a model on copyrighted material without permission? Right now, the answer looks more favorable to AI la...

Related article
Wikipedia’s Signs of AI Writing is a better guide than most AI detectors

Wikipedia’s editors have published something the AI detection industry keeps missing: a practical guide to spotting LLM-written prose that people can actually use. The page is called Signs of AI writing. It grew out of Project AI Cleanup, a volunteer...