What is Getty's model-as-article theory?

It argues that a trained AI model can itself be treated as an infringing article under UK law when imported or used.

Does dropping the core copyright claim remove the need for training data compliance?

No, it narrows the case but leaves the fundamental uncertainty intact, so robust data licensing and provenance controls remain essential.

Which measures can AI teams adopt to avoid copyright infringement?

Implement license metadata at ingest, perceptual hashing, watermark detection, deduplication, and maintain a dataset SBOM for traceability.

Generative AI June 26, 2025

Getty Drops Core UK Copyright Claim Against Stability AI

Getty Images has pulled back from the central claim in its UK copyright case against Stability AI. In the London High Court, Getty dropped the allegation that Stable Diffusion was trained on millions of infringing Getty images. That helps Stability A...

Getty drops key copyright claims against Stability AI, but the hard part for AI teams starts now

That helps Stability AI, at least on that part of the case. It does not give AI companies cover to stay sloppy about training data. Getty's UK case still includes trademark infringement, passing off, and a stranger but important theory: that the model itself could qualify as an "infringing article" when imported into or used in the UK. Getty is also still pursuing its US case, where it has sought up to $1.7 billion in damages tied to roughly 11,000 works.

The risk didn't disappear. It moved.

If your team still can't trace where training data came from, what license covered it, and which assets made it into a checkpoint, you have a real problem.

Why Getty narrowed the UK case

Getty's original UK claim went to the biggest unresolved question in generative AI: can you train a model on copyrighted material without permission? Dropping that claim doesn't answer it. It suggests Getty ran into evidentiary or jurisdiction problems serious enough to narrow the case.

That's a meaningful signal. Cross-border AI litigation is ugly. Training may happen in one country, model hosting in another, inference somewhere else, and users everywhere. If a plaintiff can't tie the allegedly infringing acts to the court's jurisdiction, the legal theory starts to wobble.

The claims Getty kept are easier to point to in court:

Trademark infringement if outputs reproduce Getty branding or watermark-like artifacts
Passing off if generated images create confusion about source or affiliation
Model-as-article theory if a trained model can itself be treated as an infringing object under UK law

That last one deserves attention. If courts start treating model weights as regulated artifacts with provenance obligations, compliance starts to look a lot more like software supply chain control and a lot less like old-fashioned copyright record keeping.

Think SBOMs for datasets and checkpoints.

The watermark problem matters

Part of what made Getty's case sticky was simple: early Stable Diffusion outputs sometimes produced warped Getty-style watermarks. Rights holders didn't need to explain diffusion models from scratch. They had a visible artifact that looked like residue from the training set.

From an ML standpoint, watermark leakage is messy evidence. It doesn't prove a specific image was memorized or reproduced verbatim. It does suggest the model retained detectable traces of source material strongly enough for them to show up in outputs. Courts and juries may find that persuasive even without a perfect technical model.

That leaves image-model builders needing filters in two places.

Before training

You need to keep obvious problem assets out of the corpus.

That usually means:

license metadata attached at ingest
perceptual hashing to catch known copyrighted or watermarked images
OCR and image classifiers to detect visible watermark text and logos
deduplication across mirrored web scrapes, because bad data spreads fast

A SHA-256 hash log is useful for exact file tracking. It won't catch resized, cropped, recompressed, or lightly edited copies. For image work, perceptual hashing or embedding-based similarity search is usually the practical option.

After inference

If the model can emit watermark-like patterns, you need output-side controls.

That could include:

a moderation pass that flags likely watermark artifacts
a CV model trained to detect known brand marks or stock-photo overlays
policy rules that block export or publication of flagged images

This adds latency and cost. It also creates false positives, which product teams hate. Still, shipping images with ghost watermarks is a legal and reputational mess.

Provenance is baseline engineering now

A lot of generative AI teams still treat dataset provenance as documentation debt. That's a bad habit.

If you're collecting images from partners, public repositories, internal archives, and web scrapes, you need to know what entered the pipeline and when. Broad summaries won't do. Asset by asset where possible, collection by collection at minimum.

A workable provenance stack usually includes:

source URL or provider ID
collection date
asserted license
file hash for exact identity
transforms applied during preprocessing
inclusion or exclusion decision
dataset version tied to each training run
checkpoint metadata linked back to dataset snapshots

This doesn't require blockchain nonsense. It does require immutability and auditability. Append-only event logs, WORM storage, signed manifests, dataset versioning, and a model registry with compliance fields cover most of it.

The tooling exists for parts of this. MLflow, DVC, Weights & Biases, and Pachyderm can track runs, artifacts, and lineage. What's usually missing is discipline. Teams log hyperparameters obsessively and keep vague spreadsheets for data rights. That's backwards.

A trivial provenance logger might look like this:

import hashlib, json, time

def log_asset(path, license_info):
sha256 = hashlib.sha256(open(path, "rb").read()).hexdigest()
entry = {
"timestamp": time.time(),
"file": path,
"sha256": sha256,
"license": license_info
}
with open("provenance.log", "a") as f:
f.write(json.dumps(entry) + "\n")

That's fine as a toy. In production, you'd want signed records, immutable storage, and collection-level manifests so you can answer the question lawyers eventually ask: which exact dataset build trained this model version?

If you can't answer that cleanly, you're exposed.

The "infringing article" theory could affect model distribution

Getty's argument that the model itself may be an infringing article sounds odd. It also sounds plausible enough to matter.

If a court accepts even part of that framing, model files stop looking like abstract software artifacts and start looking like goods with import and distribution risk. That carries real consequences:

checkpoint movement across borders may need tighter controls
model registries may need legal status fields, not just technical metadata
distributors of open-weight models could face more scrutiny than API-only vendors
enterprises may ask for provenance attestations before deploying third-party models internally

Open-weight ecosystems look more exposed here than closed API platforms. If you ship the weights, you ship the disputed artifact. API vendors have their own legal exposure, but the distribution mechanics are different.

What engineering teams should change now

The practical response is not to freeze model work. It's to stop treating data governance as something you can bolt on later.

A few changes are worth making now.

Treat datasets like build dependencies

If your app team tracks every package version in package-lock.json or poetry.lock, your ML team should track dataset snapshots with the same discipline. A model should point to an immutable dataset manifest, not "latest filtered crawl."

Add compliance checks to CI/CD

Deployment gates shouldn't stop at eval metrics. They should also verify that:

dataset license fields are present
checkpoint provenance is attached
watermark and logo scans ran
policy tags exist for downstream use restrictions

If those checks fail, the deploy should fail too.

Separate training rights from output rights

This is where plenty of teams get sloppy. A dataset may be usable for internal research but not commercial deployment. Or usable for training but not for generating customer-facing assets that mimic the source style. Legal terms vary a lot. Your metadata model needs to reflect that.

Budget for slower pipelines

Compliance work costs time, storage, and compute. Deduping huge image corpora, running watermark detectors, and preserving audit trails aren't free. They're still cheaper than retraining after a takedown or discovering months later that a production model depends on data nobody can defend.

The signal for the industry

Getty's narrowed UK case doesn't settle the training-copyright question. It does show where rights holders think they can get traction right now.

Outputs that visibly echo protected source material. Brand confusion. Distribution of model artifacts with weak provenance. Thin internal records.

Those are engineering problems before they become courtroom problems.

The companies in the safest position won't be the ones issuing polished statements about responsible AI. They'll be the ones that can produce a clean chain of custody for data, explain how suspect material was filtered, and show which controls ran before a model shipped.

It's boring work. It's also the work.

If you're running an ML platform team, this is a good time to inspect your model registry schema and ask a simple question: for your last production checkpoint, can you name the dataset version, the license basis, and the filters that ran before training?

If the answer is fuzzy, start there.

Keep going from here

Useful next reads and implementation paths

If this topic connects to a real workflow, these links give you the service path, a proof point, and related articles worth reading next.

Relevant service

AI automation services

Design AI workflows with review, permissions, logging, and policy controls.

Related proof

Marketplace fraud detection

How risk scoring helped prioritize suspicious marketplace activity.

SoundCloud revises its TOS to rule out generative AI training on uploads

SoundCloud has backed away from broad AI training language it added to its terms of service. The company now says it will not use user uploads to train generative AI models that replicate or synthesize voices, music, or likenesses. That clarification...

EU antitrust complaint targets Google AI Overviews over publisher traffic loss

Google’s AI Overviews have picked up a serious regulatory problem in Europe. A group of publishers led by the Independent Publishers Alliance has filed an antitrust complaint with the European Commission. The claim is straightforward: Google is using...

Anthropic pulls Fable 5 and Mythos 5 after US export control order

Anthropic has pulled its two newest AI models, Fable 5 and Mythos 5, after an export control order from the Trump administration cited unspecified national security concerns. Fable 5 was the broader public release. Mythos 5 was available to existing ...