Court Orders OpenAI to Reveal Why It Deleted Two Huge Datasets of Allegedly Pirated Books | Culture | LIVING LIFE FEARLESS
ishmael n. daro/OpenAI

Court Orders OpenAI to Reveal Why It Deleted Two Huge Datasets of Allegedly Pirated Books

TL;DR

  • A U.S. federal judge has ordered OpenAI to turn over internal communications — including lawyer-client correspondence — about why it deleted two massive datasets of allegedly pirated books (known as “Books1” and “Books2”).
  • The court found that OpenAI’s shifting explanations for the deletions waived attorney-client privilege, meaning the hidden communications are now discoverable.
  • The decision marks a major win for the authors and publishers suing OpenAI — potentially exposing internal deliberations that could prove whether the deletion was legitimate or an attempt to hide evidence.
  • This ruling could reshape how copyright lawsuits against AI companies proceed, especially around data provenance, evidence destruction, and the transparency of AI-training practices.

What did the court decide — and why it matters

A federal judge, Ona T. Wang, ruled that OpenAI must disclose all internal communications about the deletion of two huge datasets — “Books1” and “Books2” — that allegedly contained pirated books. These datasets were reportedly removed in 2022.

The fight centered on whether OpenAI’s internal messages with its lawyers are protected by attorney-client privilege. The judge rejected OpenAI’s position, finding that by offering “non-use” as a reason for deletion — then shifting its stance — the company effectively waived privilege. In other words: once you claim a reason, you can’t shield all related communications.

In her ruling, Judge Wang wrote that because OpenAI placed its “state of mind” at issue (claiming good faith deletion), a jury is entitled to see the basis for that claim, including any legal advice surrounding the deletion decision.

What this means for the authors, publishers, and the broader lawsuit

For the plaintiffs (authors and publishers), this is a major breakthrough. The newly available communications — Slack logs, emails, legal memos — could reveal whether the deletion was a genuine act of housekeeping or a strategic cleanup to avoid legal liability. That could influence not just this case’s outcome, but potentially future statutory damages tied to willful copyright infringement.

If the internal documents show that deletion was prompted by legal risk — rather than “non-use”— it could strengthen claims that OpenAI knowingly used copyrighted material and then attempted to suppress evidence. That, in turn, could lead to steep statutory penalties: up to $150,000 per work under U.S. copyright law.

Beyond damages, this ruling pushes AI-training practices into uncharted legal territory: metadata, internal memos, even dataset-management logs — previously considered private corporate data — may now be exposed in AI-copyright lawsuits.

This isn’t just a win for authors — it’s potentially a precedent-setter. As one legal-policy expert put it, the decision defines “the discovery framework that will govern every major AI copyright lawsuit for the next decade.”

Key takeaways:

  • AI companies can’t assume internal legal communications — especially about controversial training data — remain shielded once they publicly make claims about data deletion or usage.
  • Claiming “non-use” or “data hygiene” doesn’t automatically protect you from discovery if legal risk or privilege is at issue.
  • If courts repeatedly force disclosure, future lawsuits will likely dig deep — into legal memos, strategy docs, internal correspondence.
  • Companies may need to rethink how they handle potentially infringing data, not just at the dataset-curation level, but at the communication and documentation level as well.

What to watch next

  • Court’s review of the disclosed communications: what they reveal about the reasoning behind deleting Books1/Books2, and whether there was intent to conceal.
  • Whether plaintiffs invoke “spoliation” — i.e., destruction of evidence — which could trigger adverse-inference instructions or other sanctions if deletion was done with litigation in mind.
  • Whether this ruling emboldens authors and publishers filing similar suits against AI firms, pushing for deeper transparency into dataset sourcing and internal decision-making.
  • Whether it spurs stronger regulatory or legislative pressure around AI training data provenance, record-keeping, and rights compliance.
Damaged City Festival 2019 | Photos | LIVING LIFE FEARLESS

CULTURE (counter, pop, and otherwise) and the people who shape it.

Damaged City Festival 2019 | Photos | LIVING LIFE FEARLESS
0
Let us know what you think 🤔x