Dear Peter,
I’m so pleased that we are in alignment about the importance of preserving fair uses—including TDM and AI use and training—in the scholarly and research contexts that are the subject of our license agreements.
Your clarifications are most helpful. Specifically, you’re evoking ideological issues outside the context of the eResource license agreements libraries sign (i.e. the focus of this listserv). You are instead interested in the commercial / consumer market in which authors’ open access materials could be scraped and used to train commercially-available third-party generative AI. This is indeed a concern for authors (even scholarly authors) for a variety of reasons. I’ll share only a bit of what is going on in the public sphere regarding those issues, as it may help provide assurances or further your conversations with your authors. Apologies for only lightly touching on this non-eResources topic, though. I have to prioritize time, somehow, and I’m sorry!
As a preliminary matter, even the training of the commercial / consumer tools using OA or other publicly-available data is likely to be fair use. Training -- the use of inputs to improve and teach the tool -- is transformative enough that most legal scholars believe that training in any context (even commercial) is not unlawful. I would check out Matt Sag’s, Copyright Safety for Generative AI, and submissions to the Copyright Office by Pam Samuelson et al, Authors Alliance, and Project Lend to understand better that training is so transformative under Fair Use Factor 1 and doesn’t communicate works to the public under Factor 4 that it is likely to be fair regardless of the non-commercial or commercial context of the training. (Note: In the UC Berkeley Library submission to the Copyright Office, we address the fairness of training only for scholarly and educational uses; I am, of course, partial to our submission…).
What are the implications of all of my saying this? Simply that the training of the third-party generative AI is not a copyright infringement issue; doesn’t mean it shouldn’t be regulated for other reasons, just that it’s not copyright infringement. But while training generative AI isn’t or shouldn’t be copyright infringement, there could be potential infringement in the generative AI outputs. So let’s talk about what might be done about those outputs.
Now, most generative AI outputs are not going to be substantially similar to, and thus not potentially infringing of, the copyrighted inputs (again, read the Matt Sag article to understand why). But what about those rare cases in which the gen AI output could be substantially similar to the copyrighted input? How could authors be protected from infringement? Well in the European Union, authors now have a right to opt out of allowing their works to be used for AI training outside of the context of scientific research by research institutions or cultural heritage organizations. As I explained the other day, the Parliament did not allow opt-outs for scientific research by research institutions (i.e. Article 3 text and data mining and AI), but if it’s outside of those contexts then copyright holders can indeed opt out.
The Copyright Office in the U.S. is similarly studying what to do here, and perhaps they could take a similar opt-out regulatory approach as the European Union has done. We’re agnostic about opt-outs in the commercial context; we just want to ensure that, as in the EU, no opt-outs are granted in the research context. Other approaches the Copyright Office is considering are “compulsory licensing” (similar to the way radio licensing works) or “extended collective licensing” (ECL) (similar to the way ASCAP works). Many scholars believe that compulsory licensing and ECL simply won’t work for generative AI (I’d really encourage you to read Pam Samuelson et al and Authors Alliance’s explanations of why); others favor it as an approach (see Martin Senftleben’s paper AI Act and Author Remuneration - A Model for Other Regions).
We’ll likely know more from the Copyright Office in a year about how they want to regulate the commercial space of gen AI. In the meantime, though, authors could be protected by the website or platform terms of use on which their works appear (i.e. either via institutional repositories or on publisher online platforms). That’s because there could be a contractual restriction in those terms of use that override the right of ingesting the content for use in generative AI tools. If an entity like OpenAI were to download and train AI using content from a repository or publisher platform in contravention of those terms of use, it would be a breach of contract—but not necessarily a copyright infringement (for the reasons I discuss above). The publisher or site operator would then have recourse against the commercial entity, and by proxy, this would protect the authors. I would hope, however, that the repositories or publisher platforms’ terms of use do not also override fair use.
That’s about all I have to say on the issue of the “in-the-wild” generative AI concerns for authors, because it’s beyond the scope of the eResource licensing concerns we have. Back to work!
Best,
Rachael
From: "Potter, Peter" <[log in to unmask]>Date: Thu, 21 Mar 2024 17:28:12 +0000
Hi Rachael.
Wow! Thanks for responding so thoroughly to my questions. I’ll bet that others will agree with me in saying that you’ve helped to bring clarity to any number of concerns that come up again and again in discussions about AI and copyright.
I won’t go into great detail here because, I’m sure we actually agree in principle on the major points you’ve made. I simply want to clarify what I was trying to say (unartfully, it seems) in my last message:
My point about authors being protected by copyright law from having their works exploited (i.e., used) for TDM and AI was not to question the principle of fair use—or even to question the University of California’s position on TDM/AI. As far as I’m concerned, researchers at the University of California (and anywhere else, for that matter) are entirely within their right to use TDI/AI to the extent that it falls within the bounds of fair use. My concern is with those who might use TDI/AI (primarily generative AI) in ways that go beyond fair use and therefore violate copyright.
So, you are absolutely correct when you say, a few paragraphs down in your response, that what I actually want to avoid is others profiting from using licensed products with generative AI to make new works. But, again, my concern here is definitely not with institutions like the University of California licensing content for fair scholarly and research uses. My real concern is with commercial AI systems that are able to ingest the increasingly massive body of open access HSS scholarship on the web to generate new works—without crediting the original authors much less remunerating them in instances where remuneration is required. In short, what recourse do authors have should they suspect that their intellectual property is being used in ways that violate the open license?
Thanks, again.
Peter Potter