On June 28, two novelists filed a lawsuit against OpenAI, the company that created ChatGPT. The lawsuit alleges that OpenAI’s use of the novelists’ books to train GPT, the series of language models that power ChatGPT, amounts to copyright infringement. They have sought to have their lawsuit certified as a class action on behalf of all authors in the United States whose copyrighted works were used by OpenAI.

This case raises some novel questions in the realm of copyright law. These questions strike at the heart of the need to find the right balance between encouraging innovation in AI and protecting the intellectual property rights of creators.

What’s in an LLM, anyway?

Generative AI technology is powered by complex algorithmic systems known as large language models (LLMs). An LLM includes a set of algorithms that analyze data in order to build and optimize a vast network of nodes. A node can be analogized to a neuron in the human brain. The relationships and connections between the nodes are defined by numerical values known as weights. The algorithms, the network, and the weights are what collectively comprise an LLM.

The data analyzed by an LLM is referred to as training data. An LLM doesn’t work by simply storing its training data or “filing away” the information in the training data. In fact, none of the training data itself is necessarily retained within the LLM. Rather, the LLM works by recognizing patterns in the data and by interpolating meanings and implications based on those patterns as well as the LLM’s previous analysis of other data. The weights are the means by which the LLM retains this information. As training data is fed to an LLM, the values of the LLM’s weights are continuously and automatically adjusted to optimize and fine-tune the LLM.

The volume of training data needed to effectively train an LLM is truly massive. The more training data fed to an LLM, the “smarter” and more sophisticated it becomes (hence the term “training”). And the “smarter” and more sophisticated it becomes, the more questions it can answer and the better the quality of content generated using the LLM.

OpenAI acquires training data for GPT from a variety of sources. Among these sources are various digital libraries and repositories that are publicly available on the Internet. These sources include copies of copyrighted books.

The plaintiffs take issue with OpenAI’s copying of digital libraries that include their books and its ingestion of those digital libraries as part of the training data for GPT. They also take issue with the summaries of their works that ChatGPT generates using its LLMs in response to user prompts.

Is all fair in love and AI?

The most common defense to claims of copyright infringement is fair use. Fair use is a doctrine under US copyright law that permits the use of copyrighted works without consent or compensation in certain circumstances. Courts consider four factors in determining whether the use of a copyrighted work qualifies as fair use:

The purpose and character of the use
The nature of the copyrighted work
The amount and substantiality of the portion used relative to the work as a whole
The effect of the use upon the potential market for the copyrighted work

No single factor is dispositive. A court may find a particular use of a copyrighted work to be fair use even if it’s for commercial purposes. Conversely, a court may find a particular use of a copyrighted work to be infringing even it’s for noncommercial purposes. Courts analyze these factors holistically on a case-by-case basis.

The crux of the matter, therefore, is whether OpenAI’s use of the plaintiffs’ works in training GPT and its use of the LLMs to generate summaries of the works in response to user prompts qualifies as fair use.

This is a story of transformation… maybe.

As noted above, LLMs don’t necessarily retain copies of the data used to train them, nor does the training process necessarily result in the creation of permanent copies. Any copies that are created are usually temporary and incidental to the training process itself. The plaintiffs don’t allege that OpenAI is retaining verbatim copies or reproductions of their works.

Rather, the plaintiffs argue that the GPT LLMs themselves are infringing works by virtue of having been trained on their copyrighted books. According to the plaintiffs, “[b]ecause the OpenAI Language Models cannot function without the expressive information extracted from Plaintiffs’ works (and others) and retained inside them, the OpenAI Language Models are themselves infringing derivative works.”

This raises the question of whether the training process is a transformative use of the plaintiffs’ books. A transformative use is one that has “a further purpose or different character, and do[es] not substitute for the original use of the work.” Courts are more likely to find fair use in cases where they conclude that a defendant’s use of a copyrighted work is transformative.

Is there a story within the story?

Training an LLM on a particular work doesn’t result in the creation of any distinct node or connection that’s specific to that work. Rather, the effect of the work is on the LLM as a whole. Accordingly, an LLM is the product of all data used to train the LLM. An LLM can’t be sliced and diced into discrete components where each component only includes information from an individual work found in the training data.

Moreover, LLMs are the products of numerous works, no single one of which is necessarily retained verbatim within the LLM either in whole or in part. The plaintiffs themselves acknowledge that an LLM “mixes together expressive material derived from many sources.” Based on this, OpenAI would argue that its use of the plaintiffs’ books is transformative because the extent to which its LLMs incorporate information from any single one of the books (or any other individual work, for that matter) is infinitesimal.

In this vein, OpenAI might analogize an LLM to human-generated content. In creating new content, a person may draw upon other content they consumed at various points throughout their life without expressly replicating that content verbatim. For example, a creative writing instructor may author a textbook on creative writing that cites and draws upon writing styles used by famous authors. Such a textbook would be transformative and would clearly not have infringed upon the authors’ works merely by virtue of citing and drawing upon their writing styles. Neither would any works authored by readers of the textbook be infringing merely by virtue of drawing inspiration from the writing styles described in the textbook.

Has the truth of the story been greatly exaggerated?

On the other hand, the LLMs used by generative AI are not human intellect, nor are they the same as content generated by human intellect. Unlike human intellect or human-generated content, an LLM is shaped by its training data in very direct and quantifiable ways. And unlike human-generated content, content generated using an LLM necessarily, in one way or another, draws upon all of the data used to train the LLM.

In an exhibit to the complaint, the plaintiffs reproduce ChatGPT’s summaries of their novels along with the prompts used to generate those summaries. Some of the prompts ask for summaries of entire novels, while others ask for summaries of certain portions of the novels. According to the plaintiffs, all of ChatGPT’s summaries are descriptive, detailed, and mostly accurate. On this basis, the plaintiffs may argue that ChatGPT’s LLMs are not transformative because they incorporate significant amounts of very detailed information from the plaintiffs’ novels. The plaintiffs would argue that the GPT LLMs and the summaries generated from them amount to nothing more than a repackaging of the information in the plaintiffs’ novels.

In this vein, the plaintiffs might analogize the use of their novels as training data for an LLM to the adaptation of a novel into a movie or a play. That type of use, though it may add original creative elements, is generally not seen as transformative. The court in Authors Guild v. Google, Inc., a seminal case on the doctrine of transformation and fair use, noted that adaptations of novels into movies or plays are not transformative because they are mere “changes of form” that “do not involve the kind of transformative purpose that favors a fair use finding.”

So, is the incorporation of “expressive information” from copyrighted works into an LLM a transformative use of the works? Or does it amount to a mere “change of form” that lacks “the kind of transformative purpose that favors a fair use finding”? These are among the questions that courts will need to grapple with in resolving the matter of whether training LLMs on copyrighted works qualifies as fair use or not.

How do you go by the book when the book hasn’t been written yet?

Irrespective of the merits of this particular case, there will likely be a flood of copyright infringement lawsuits against OpenAI and others who make unlicensed use of copyrighted works to train LLMs. The novelty and sudden success of ChatGPT coupled with the uncharted regulatory waters of generative AI invite copious amounts of litigation.

Perhaps in an effort to preempt such a trend, Japan recently declared that it would not enforce copyright on the use of content to train AI models. Japan’s policy allows for AI language models to be trained on any copyrighted content “regardless of whether it is for non-profit or commercial purposes, whether it is an act other than reproduction, or whether it is content obtained from illegal sites or otherwise.”

Japan’s decision essentially amounts to a declaration that the use of copyrighted works to train LLMs qualifies as fair use. In so declaring, Japan is making a policy judgment that the societal benefits of innovation in generative AI technology outweigh the importance of enforcing intellectual property rights. This may embolden generative AI companies like OpenAI to lobby for the same policy to be adopted in the United States.

In the United States, however, administrative agencies like the US Copyright Office do not have the authority to decide what constitutes fair use. Only Congress and the federal courts can do that. Given how slowly the gears of the legislative and judicial branches tend to move, it could be a while before this chapter in the story of generative AI is fully written.

Stay tuned for Part III!

The Many Lawsuits of OpenAI, Part II: Copyright Infringement

What’s in an LLM, anyway?

Is all fair in love and AI?

This is a story of transformation… maybe.

Is there a story within the story?

Has the truth of the story been greatly exaggerated?

How do you go by the book when the book hasn’t been written yet?

Who is Dev Legal?

Sabir Ibrahim

Managing Attorney

What can Dev Legal do for you?

Areas Of Expertise

Technology License Agreements

Open Source Software Matters

SaaS Agreements

Intellectual Property Counseling

Product Counseling

Terms of Service and Privacy Policies

Assessment of Contractual Requirements

Information Management Policies

Risk Mitigation Strategy

Contact Us

Get In Touch

Email

Phone

Mail

The Many Lawsuits of OpenAI, Part II: Copyright Infringement

What’s in an LLM, anyway?

Is all fair in love and AI?

This is a story of transformation… maybe.

Is there a story within the story?

Has the truth of the story been greatly exaggerated?

How do you go by the book when the book hasn’t been written yet?

Who is Dev Legal?

Sabir Ibrahim

Managing Attorney

What can Dev Legal do for you?

Areas Of Expertise

Technology License Agreements

Open Source Software Matters

SaaS Agreements

Intellectual Property Counseling

Product Counseling

Terms of Service and Privacy Policies

Assessment of Contractual Requirements

Information Management Policies

Risk Mitigation Strategy

Join Our Email Newsletter List And Receive Our Free Compliance Explainer

Thank you!

Contact Us

Get In Touch

Email

Phone

Mail

Thank you!