Context
Last week’s Columbia Journalism Review piece details two diverging responses to generative-AI platforms that have hoovered up news articles and other copyrighted works: negotiated licensing deals (e.g., OpenAI–Axel Springer, OpenAI–AP, Anthropic–FT) and high-stakes litigation (most prominently, The New York Times v. OpenAI/Microsoft). Both strategies aim to answer a deceptively simple question with billion-dollar consequences: Is it lawful fair use to train AI systems on copyrighted content without permission or payment?
Why the Deals Matter
- Market signals – When multiple rightsholders voluntarily cut checks instead of suing, they create a factual record that there is a viable licensing market for training data. Under U.S. fair-use factor 4 ("effect upon the potential market"), the existence of such a market weighs against fair use.
- Precedent by contract – Contracts don’t bind non-signatories, but they can influence courts. Judges assessing whether a use is “customary” often look to industry practice. A critical mass of licenses could make “unpaid scraping” look increasingly outside the norm.
- Valuation benchmarks – Confidential deal terms leak. Plaintiffs can point to dollar figures—"OpenAI paid X for Y articles"—as concrete evidence of economic harm when their own works are used for free.
Why the Lawsuits Matter
- Factual excavation – Discovery can force platforms to disclose exactly how training data is used, filling the information vacuum that has plagued fair-use analysis so far.
- Doctrinal testing – Prior fair-use precedents (Google Books, Warhol v. Goldsmith) provide analogies but not direct answers. Litigation lets courts test whether AI training is more like "non-expressive indexing" (favors fair use) or "commercial substitution" (weighs against).
- Possibility of statutory gaps – If courts split or Congress intervenes, we may see sui generis AI-training rights akin to the DMCA’s anti-circumvention provisions in 1998.
Tensions Exposed
• Transformative purpose vs. transformative output – Platforms argue that using text as raw material to learn statistical relationships is transformative. Publishers counter that when chatbots regurgitate near-verbatim excerpts or upend search traffic, the use becomes exploitative.
• Public benefit vs. private capture – Courts traditionally favor fair use that yields broad public knowledge (think search indexes). Critics note that generative AI locks insights behind proprietary APIs, blunting the “public benefit” claim.
• Scale as destiny – Fair-use jurisprudence rarely grappled with trillion-token datasets. Massive scale magnifies both the utility and the potential market harm, pushing judges into uncharted territory.
Possible Outcomes
- Coexistence via licensing – If enough big outlets sign deals, platforms may shift to a paid-by-default model, quietly conceding that unlicensed training is too risky.
- Split the baby – Courts might deem training fair use but output that contains protected expression infringing, forcing technical guardrails and indemnity regimes.
- Legislative reset – Prolonged uncertainty could spur Congress to craft a compulsory license or a text-and-data-mining exception, as the EU did with its DSM Directive.
Takeaway
Licenses and lawsuits are not mutually exclusive skirmishes—they are complementary fronts in the same war. Each new deal weakens the fair-use defense by proving a market exists; each new lawsuit pressures platforms to settle on publisher-friendly terms. Until an appellate court—or Congress—draws a bright line, the AI industry will navigate a patchwork of private contracts and legal risk, with the definition of fair use hanging in the balance.