Getting AI to Behave—and How We Can Find Out When It Doesn’t

A week doesn’t go by these days without AI coming up with new ways to amaze us and alarm us—sometimes both at the same time. AI has actually been around a long time, and there are any number of things it does for us that are so useful we’ve come to take it for granted. We all gripe about autocorrect and predictive text, for example, but we all use it.

But when generative AI burst onto the scene a few months ago, the shock was on a different level. Generative AI can now generate text and images from a simple one-sentence prompt. It’s even creating realistic videos. What next?

There’s no denying the many implications around AI. But for this column, I’ll focus on just one that represents a serious threat to authors and publishers: the creation of fake books. For the publishing business, concerns about a sudden tsunami of fake books clogging the marketplace are very real. AI today is capable of credibly emulating authors. But even awful fake books can compete with real books from real publishers.

The culprits are the large language models, or LLMs, that have been trained via breathtakingly large-scale web-crawling of online content—including such gems as the CommonCrawl (six billion web pages) and books3 (190,000 e-books, many of them allegedly pirated). These LLMs are not giant databases that simply regurgitate actual content; instead, they are prediction engines that use all that content they’ve been trained on to generate new content.

Despite what some claim is a somewhat sordid origin, LLMs are proving to be very useful in many ways. Writers, students, scholars, researchers, and businesspeople use them every day to streamline their work. But how do we rein in the irresponsible behavior without losing all that undeniable utility?

Two ways are top of the list. First, we must develop a way to document authentic content, including whether that content has been generated partially or entirely by AI. Second, we must develop a way to identify content that its creator wants to prohibit being used for training an LLM. Most importantly, these things must be tamper-proof or at least tamper-evident.

I believe the combination of three standards offers a potential solution. I’ve written before about C2PA, the Coalition for Content Provenance and Authenticity. It’s a technical standard for embedding provenance information into media assets, including text, images, and video. It’s being widely adopted, not just by media organizations but recently by OpenAI, Meta, and Google as well.

Less well-known but extremely important is a W3C standard called Verifiable Credentials. It allows creators and rights-
holders to properly attribute content by either embedding identity information in the media file or by binding the credentials externally, outside of the asset itself.

“

For authors and publishers concerned about AI fakes, there’s a workable solution on the horizon—and the pieces are coming together rapidly.

”

The glue that promises to hold it all together is the International Standard Content Code (ISCC), which enables associating product metadata, provenance information like C2PA, and the Verifiable Credentials to the ISCC itself, even in cases where metadata is removed from a file.

The ISCC is not metadata in the conventional sense, and it is not embedded in an asset. Rather, it’s generated from the asset, including, when applied, its C2PA information, certificates, and credentials. The ISCC is composed of four components that describe the content on various layers and allow it to assess metadata similarity, content similarity, data similarity, and data integrity between two instances of an ISCC.

Thus, a public declaration of an ISCC can enable the persistent binding of metadata, rights, attribution, and other information (such as “do not use to train AI”) to the actual digital asset. And by generating the ISCC on their end, AI providers can derive the embedded declaration including proper attribution from the ISCC, and respect the requirements set out by the legitimate rightsholders. All of this helps responsible players to play responsibly, and irresponsible players to be found out. For authors and publishers concerned about the rise of AI fakes, this means there’s a workable solution on the horizon. And the pieces are coming together rapidly.

Bill Kasdorf is principal at Kasdorf & Associates, LLC, and a founding partner of Publishing Technology Partners.