The rapid adoption of artificial intelligence (AI) has sparked a pressing legal debate over how copyrighted materials can be used to train generative AI systems, particularly large language models (LLMs), without permission from the copyright owners. Currently, there are more than 25 pending suits involving LLMs and allegations of copyright infringement. LLMs are designed to assimilate vast amounts of text, enabling them to generate new content such as books, artwork, music, or computer code. Recently, a federal court in California issued two separate decisions that established new boundaries for what constitutes “fair use” of copyrighted material to train LLMs. (Fair use is a defense to copyright infringement.) In both cases, the court held that using copyrighted works to train LLMs is a fair use and thus not copyright infringement, at least in some circumstances.
In Bartz v. Anthropic, a group of authors sued Anthropic, asserting that Anthropic had infringed their copyrights by using both pirated digital books and scans of lawfully purchased texts to train an LLM. The court was asked to determine whether both the use of these materials for training LLMs and the storage of these materials in digital libraries qualified as fair uses. In Kadrey v. Meta Platforms, a group of authors alleged that Meta used unauthorized copies of their books, obtained from free and downloadable “shadow libraries,” to train its LLM known as Llama. The authors claimed that Meta’s unauthorized use constituted copyright infringement. Meta responded by arguing that the materials were used solely for training purposes and were not redistributed or used to facilitate unlawful activity by others. Meta also presented evidence showing that it did not cause any measurable harm to the market for the copyrighted books.
In both cases, the courts agreed that feeding copyrighted texts into an AI model is transformative and does not directly compete with the original books. Consequently, the use of copyrighted materials for the purpose of training LLMs is considered fair use and thus not copyright infringement. In Kadrey,the court granted summary judgment in favor of Meta after the authors failed to prove any substantial harm to the market for the copyrighted materials. The court, however, has yet to opine on whether Meta’s use of allegedly pirated texts infringed the copyrights. In Bartz, the court similarly held that training Anthropic’s LLM was transformative and therefore fair use. The Bartz court relied on the fact that the alleged infringement was limited to the training of the LLM and was not related to the LLM’s output. However, the court ruled that the issue of retaining pirated texts should be treated separately, and as a result, the court refused to dismiss the infringement claim based on retention of pirated texts. In Kadrey, the court emphasized that storing pirated copies indefinitely was a distinct legal question and not necessarily protected by the fair use doctrine.
These initial rulings indicate that, at least in some circumstances, it is acceptable to use copyrighted materials to train AI. But the judges in Kadrey and Bartz expressed different views on the underlying issues involved and the impact of AI on the market. As a result, the outcomes could have changed depending on the facts and how much weight would be given to market dilution when determining fair use. Further, both of these cases are related to the use of copyrighted books. Will the theories used in these opinions result in different outcomes when the work involves other copyrighted materials, such as artistic works or music? Only time will tell. While there may be protection for the input side of the equation, what appears to be open for debate is whether AI outputs, though generated from transformed data, can nonetheless infringe copyrights in instances when the character and style of the AI-generated work is sufficiently similar to the copyrighted input. Consequently, when it comes to audio and visual works that subsist to a greater degree on visceral expression, the market impression and dilutionary impacts caused by the AI-generated work may be factors that weigh more favorably toward infringement.
In view of these decisions, what should you do? If you are a publisher or copyright creator, have you taken steps to prevent your work from being used in ways you do not want or intend? If you are a high-tech company creating the next generation of AI tools, do you have the governance controls in place to determine whether your developers’ use of shadow library data has increased your risk?
While these opinions provide a positive step in creating some predictability in the market for both copyright creators and high-tech companies, the issues are far from settled. To manage risk to your business and maximize the protection of your investments, please contact the authors or any member of the Intellectual Property Group or Artificial Intelligence Team at McCarter & English.
*James Vartanian, a law clerk at McCarter & English not yet admitted to the bar, contributed to this alert.