In the internet age, copyright infringement lawsuits regularly generate headlines, and concentrate public attention on the speed at which technology races ahead of longstanding legal precedent. In 2004, at the crest of Web 1.0, the Association of American Publishers and the U.S.-based Authors Guild sued search giant Google for scanning entire library collections of books and other publications without permission. The litigants eventually lost the so-called Google Books case, with a federal judge ruling in 2013 that the scanning was a “transformative use” and not infringement.

The web in 2024 is nothing like in 2004, and although the plaintiffs haven’t changed since the Google Books case two decades ago, today’s defendants are fresh entrants on the tech scene: artificial intelligence companies.

In November 2022, OpenAI launched ChatGPT, a free-to-the-public “generative AI” tool that answers questions in human language. In response, Google, Apple, Meta (Facebook), and Amazon have all accelerated development of competing AI programs. “Talking to the computer is going to be a big deal,” OpenAI CEO Sam Altman predicted on Twitter.

Also predictable was that copyright holders of books, artwork, and music would charge OpenAI with copyright infringement. Indeed, aggrieved creators have lined up to lawyer up. The Authors Guild filed a class action lawsuit. Best-known among the current cases is the New York Times suit brought against OpenAI in December 2023.

Large language models (LLMs) like ChatGPT “learn” to create images and texts after being “trained” on enormous amounts of existing works converted to data. The suits allege that chatbots, robots, and learning machines got their educations from commercially produced works without permission or compensation.

With large language models, the emphasis truly is on “large.” The BBC has reported that training sources for ChatGPT amounted to 570 GB of data or 300 billion words. Beyond books, journals, and newspapers, AI databases derive from a vast online trove of digitized library and museum collections, public documents like court proceedings and legislation, and Wikipedia entries and social media posts.

But long-lived media conglomerates with roots deep in analog soil are a notoriously dwindling number. And the dominant media actors today are, literally, everyone else: you and me, the creators of so-called user-generated content (UGC). The LLMs’ consumption of individual users’ data means that, in effect, all of us have become stakeholders in them. In the new media ecosystem, AI and UGC interact in a symbiotic cycle of information and transformation. (Misinformation and disinformation unfortunately also figure in this synthesis.)

Copyright, meanwhile, is automatically ascribed to each “fixed” work, whether a photograph taken on a smartphone or a dance posted on TikTok. Billions of such works are now created every day, “intellectual property” technically owned either by an individual or an organization.

With the advent of LLMs, it is now time to consider viewing copyright additionally as a public good. And as such, any compensation for use of works to train AI solutions should not only go to profit-minded businesses but to the public.

Like historic monuments, TikTok dances and Instagram posts constitute the cultural capital that a community or nation in 2024 draws on for inspiration. Like natural resources, social media output derives value in the aggregate.

In New Zealand, government, museums, and creative industry representatives have acknowledged a responsibility to respect Ma–ori taonga (treasures) when used in any AI system, and to consider any effects of such use on Ma–ori tikanga (culture). In Italy, the Cultural Heritage Code requires authorization and payment for digital uses of national treasures including art by Leonardo DaVinci. More governments, including the U.S. Congress, should likewise move with urgency to assert a national interest in communal creative expression. OpenAI, along with Meta, Apple, Google, Amazon, and others who stand to benefit, must recognize the debt owed to the people for the data that fuels their AI products.

Fees collected need not be burdensome to developers. And they could certainly be useful. Perhaps a portion can be used to underwrite essential public education in digital literacy and civil discourse online.

Christopher Kenneally is a publishing industry analyst and marketing consultant. A version of this story previously ran in the 2024 PW Frankfurt Digital Supplement.