Getty Drops Core UK Copyright Claim Against Stability AI
Getty Images has pulled back from the central claim in its UK copyright case against Stability AI. In the London High Court, Getty dropped the allegation that Stable Diffusion was trained on millions of infringing Getty images. That helps Stability A...
Getty drops key copyright claims against Stability AI, but the hard part for AI teams starts now
Getty Images has pulled back from the central claim in its UK copyright case against Stability AI. In the London High Court, Getty dropped the allegation that Stable Diffusion was trained on millions of infringing Getty images.
That helps Stability AI, at least on that part of the case. It does not give AI companies cover to stay sloppy about training data. Getty's UK case still includes trademark infringement, passing off, and a stranger but important theory: that the model itself could qualify as an "infringing article" when imported into or used in the UK. Getty is also still pursuing its US case, where it has sought up to $1.7 billion in damages tied to roughly 11,000 works.
The risk didn't disappear. It moved.
If your team still can't trace where training data came from, what license covered it, and which assets made it into a checkpoint, you have a real problem.
Why Getty narrowed the UK case
Getty's original UK claim went to the biggest unresolved question in generative AI: can you train a model on copyrighted material without permission? Dropping that claim doesn't answer it. It suggests Getty ran into evidentiary or jurisdiction problems serious enough to narrow the case.
That's a meaningful signal. Cross-border AI litigation is ugly. Training may happen in one country, model hosting in another, inference somewhere else, and users everywhere. If a plaintiff can't tie the allegedly infringing acts to the court's jurisdiction, the legal theory starts to wobble.
The claims Getty kept are easier to point to in court:
- Trademark infringement if outputs reproduce Getty branding or watermark-like artifacts
- Passing off if generated images create confusion about source or affiliation
- Model-as-article theory if a trained model can itself be treated as an infringing object under UK law
That last one deserves attention. If courts start treating model weights as regulated artifacts with provenance obligations, compliance starts to look a lot more like software supply chain control and a lot less like old-fashioned copyright record keeping.
Think SBOMs for datasets and checkpoints.
The watermark problem matters
Part of what made Getty's case sticky was simple: early Stable Diffusion outputs sometimes produced warped Getty-style watermarks. Rights holders didn't need to explain diffusion models from scratch. They had a visible artifact that looked like residue from the training set.
From an ML standpoint, watermark leakage is messy evidence. It doesn't prove a specific image was memorized or reproduced verbatim. It does suggest the model retained detectable traces of source material strongly enough for them to show up in outputs. Courts and juries may find that persuasive even without a perfect technical model.
That leaves image-model builders needing filters in two places.
Before training
You need to keep obvious problem assets out of the corpus.
That usually means:
- license metadata attached at ingest
- perceptual hashing to catch known copyrighted or watermarked images
- OCR and image classifiers to detect visible watermark text and logos
- deduplication across mirrored web scrapes, because bad data spreads fast
A SHA-256 hash log is useful for exact file tracking. It won't catch resized, cropped, recompressed, or lightly edited copies. For image work, perceptual hashing or embedding-based similarity search is usually the practical option.
After inference
If the model can emit watermark-like patterns, you need output-side controls.
That could include:
- a moderation pass that flags likely watermark artifacts
- a CV model trained to detect known brand marks or stock-photo overlays
- policy rules that block export or publication of flagged images
This adds latency and cost. It also creates false positives, which product teams hate. Still, shipping images with ghost watermarks is a legal and reputational mess.
Provenance is baseline engineering now
A lot of generative AI teams still treat dataset provenance as documentation debt. That's a bad habit.
If you're collecting images from partners, public repositories, internal archives, and web scrapes, you need to know what entered the pipeline and when. Broad summaries won't do. Asset by asset where possible, collection by collection at minimum.
A workable provenance stack usually includes:
- source URL or provider ID
- collection date
- asserted license
- file hash for exact identity
- transforms applied during preprocessing
- inclusion or exclusion decision
- dataset version tied to each training run
- checkpoint metadata linked back to dataset snapshots
This doesn't require blockchain nonsense. It does require immutability and auditability. Append-only event logs, WORM storage, signed manifests, dataset versioning, and a model registry with compliance fields cover most of it.
The tooling exists for parts of this. MLflow, DVC, Weights & Biases, and Pachyderm can track runs, artifacts, and lineage. What's usually missing is discipline. Teams log hyperparameters obsessively and keep vague spreadsheets for data rights. That's backwards.
A trivial provenance logger might look like this:
import hashlib, json, time
def log_asset(path, license_info):
sha256 = hashlib.sha256(open(path, "rb").read()).hexdigest()
entry = {
"timestamp": time.time(),
"file": path,
"sha256": sha256,
"license": license_info
}
with open("provenance.log", "a") as f:
f.write(json.dumps(entry) + "\n")
That's fine as a toy. In production, you'd want signed records, immutable storage, and collection-level manifests so you can answer the question lawyers eventually ask: which exact dataset build trained this model version?
If you can't answer that cleanly, you're exposed.
The "infringing article" theory could affect model distribution
Getty's argument that the model itself may be an infringing article sounds odd. It also sounds plausible enough to matter.
If a court accepts even part of that framing, model files stop looking like abstract software artifacts and start looking like goods with import and distribution risk. That carries real consequences:
- checkpoint movement across borders may need tighter controls
- model registries may need legal status fields, not just technical metadata
- distributors of open-weight models could face more scrutiny than API-only vendors
- enterprises may ask for provenance attestations before deploying third-party models internally
Open-weight ecosystems look more exposed here than closed API platforms. If you ship the weights, you ship the disputed artifact. API vendors have their own legal exposure, but the distribution mechanics are different.
What engineering teams should change now
The practical response is not to freeze model work. It's to stop treating data governance as something you can bolt on later.
A few changes are worth making now.
Treat datasets like build dependencies
If your app team tracks every package version in package-lock.json or poetry.lock, your ML team should track dataset snapshots with the same discipline. A model should point to an immutable dataset manifest, not "latest filtered crawl."
Add compliance checks to CI/CD
Deployment gates shouldn't stop at eval metrics. They should also verify that:
- dataset license fields are present
- checkpoint provenance is attached
- watermark and logo scans ran
- policy tags exist for downstream use restrictions
If those checks fail, the deploy should fail too.
Separate training rights from output rights
This is where plenty of teams get sloppy. A dataset may be usable for internal research but not commercial deployment. Or usable for training but not for generating customer-facing assets that mimic the source style. Legal terms vary a lot. Your metadata model needs to reflect that.
Budget for slower pipelines
Compliance work costs time, storage, and compute. Deduping huge image corpora, running watermark detectors, and preserving audit trails aren't free. They're still cheaper than retraining after a takedown or discovering months later that a production model depends on data nobody can defend.
The signal for the industry
Getty's narrowed UK case doesn't settle the training-copyright question. It does show where rights holders think they can get traction right now.
Outputs that visibly echo protected source material. Brand confusion. Distribution of model artifacts with weak provenance. Thin internal records.
Those are engineering problems before they become courtroom problems.
The companies in the safest position won't be the ones issuing polished statements about responsible AI. They'll be the ones that can produce a clean chain of custody for data, explain how suspect material was filtered, and show which controls ran before a model shipped.
It's boring work. It's also the work.
If you're running an ML platform team, this is a good time to inspect your model registry schema and ask a simple question: for your last production checkpoint, can you name the dataset version, the license basis, and the filters that ran before training?
If the answer is fuzzy, start there.
Useful next reads and implementation paths
If this topic connects to a real workflow, these links give you the service path, a proof point, and related articles worth reading next.
Design AI workflows with review, permissions, logging, and policy controls.
How risk scoring helped prioritize suspicious marketplace activity.
SoundCloud has backed away from broad AI training language it added to its terms of service. The company now says it will not use user uploads to train generative AI models that replicate or synthesize voices, music, or likenesses. That clarification...
Google’s AI Overviews have picked up a serious regulatory problem in Europe. A group of publishers led by the Independent Publishers Alliance has filed an antitrust complaint with the European Commission. The claim is straightforward: Google is using...
Federal judges in California and New York just gave AI companies an early win on one of the biggest legal questions in the industry: can you train a model on copyrighted material without permission? Right now, the answer looks more favorable to AI la...