From: Rachael G Samberg <[log in to unmask]>

Date: Fri, 22 Mar 2024 07:44:07 -0700

Dear Peter,

I’m so pleased that we are in alignment about the importance of preserving fair uses—including TDM and AI use and training—in the scholarly and research contexts that are the subject of our license agreements.

Your clarifications are most helpful. Specifically, you’re evoking ideological issues outside the context of the eResource license agreements libraries sign (i.e. the focus of this listserv). You are instead interested in the commercial / consumer market in which authors’ open access materials could be scraped and used to train commercially-available third-party generative AI. This is indeed a concern for authors (even scholarly authors) for a variety of reasons. I’ll share only a bit of what is going on in the public sphere regarding those issues, as it may help provide assurances or further your conversations with your authors. Apologies for only lightly touching on this non-eResources topic, though. I have to prioritize time, somehow, and I’m sorry!

As a preliminary matter, even the training of the commercial / consumer tools using OA or other publicly-available data is likely to be fair use. Training -- the use of inputs to improve and teach the tool -- is transformative enough that most legal scholars believe that training in any context (even commercial) is not unlawful. I would check out Matt Sag’s, Copyright Safety for Generative AI, and submissions to the Copyright Office by Pam Samuelson et al, Authors Alliance, and Project Lend to understand better that training is so transformative under Fair Use Factor 1 and doesn’t communicate works to the public under Factor 4 that it is likely to be fair regardless of the non-commercial or commercial context of the training. (Note: In the UC Berkeley Library submission to the Copyright Office, we address the fairness of training only for scholarly and educational uses; I am, of course, partial to our submission…).

What are the implications of all of my saying this? Simply that the training of the third-party generative AI is not a copyright infringement issue; doesn’t mean it shouldn’t be regulated for other reasons, just that it’s not copyright infringement. But while training generative AI isn’t or shouldn’t be copyright infringement, there could be potential infringement in the generative AI outputs. So let’s talk about what might be done about those outputs.

Now, most generative AI outputs are not going to be substantially similar to, and thus not potentially infringing of, the copyrighted inputs (again, read the Matt Sag article to understand why). But what about those rare cases in which the gen AI output could be substantially similar to the copyrighted input? How could authors be protected from infringement? Well in the European Union, authors now have a right to opt out of allowing their works to be used for AI training outside of the context of scientific research by research institutions or cultural heritage organizations. As I explained the other day, the Parliament did not allow opt-outs for scientific research by research institutions (i.e. Article 3 text and data mining and AI), but if it’s outside of those contexts then copyright holders can indeed opt out.

The Copyright Office in the U.S. is similarly studying what to do here, and perhaps they could take a similar opt-out regulatory approach as the European Union has done. We’re agnostic about opt-outs in the commercial context; we just want to ensure that, as in the EU, no opt-outs are granted in the research context. Other approaches the Copyright Office is considering are “compulsory licensing” (similar to the way radio licensing works) or “extended collective licensing” (ECL) (similar to the way ASCAP works). Many scholars believe that compulsory licensing and ECL simply won’t work for generative AI (I’d really encourage you to read Pam Samuelson et al and Authors Alliance’s explanations of why); others favor it as an approach (see Martin Senftleben’s paper AI Act and Author Remuneration - A Model for Other Regions).

We’ll likely know more from the Copyright Office in a year about how they want to regulate the commercial space of gen AI. In the meantime, though, authors could be protected by the website or platform terms of use on which their works appear (i.e. either via institutional repositories or on publisher online platforms). That’s because there could be a contractual restriction in those terms of use that override the right of ingesting the content for use in generative AI tools. If an entity like OpenAI were to download and train AI using content from a repository or publisher platform in contravention of those terms of use, it would be a breach of contract—but not necessarily a copyright infringement (for the reasons I discuss above). The publisher or site operator would then have recourse against the commercial entity, and by proxy, this would protect the authors. I would hope, however, that the repositories or publisher platforms’ terms of use do not also override fair use.

That’s about all I have to say on the issue of the “in-the-wild” generative AI concerns for authors, because it’s beyond the scope of the eResource licensing concerns we have. Back to work!

Best,

Rachael

On Thu, Mar 21, 2024 at 5:14 PM LIBLICENSE <[log in to unmask]> wrote:

From: "Potter, Peter" <[log in to unmask]>
Date: Thu, 21 Mar 2024 17:28:12 +0000

Hi Rachael.

Wow! Thanks for responding so thoroughly to my questions. I’ll bet that others will agree with me in saying that you’ve helped to bring clarity to any number of concerns that come up again and again in discussions about AI and copyright.

I won’t go into great detail here because, I’m sure we actually agree in principle on the major points you’ve made. I simply want to clarify what I was trying to say (unartfully, it seems) in my last message:

My point about authors being protected by copyright law from having their works exploited (i.e., used) for TDM and AI was not to question the principle of fair use—or even to question the University of California’s position on TDM/AI. As far as I’m concerned, researchers at the University of California (and anywhere else, for that matter) are entirely within their right to use TDI/AI to the extent that it falls within the bounds of fair use. My concern is with those who might use TDI/AI (primarily generative AI) in ways that go beyond fair use and therefore violate copyright.

So, you are absolutely correct when you say, a few paragraphs down in your response, that what I actually want to avoid is others profiting from using licensed products with generative AI to make new works. But, again, my concern here is definitely not with institutions like the University of California licensing content for fair scholarly and research uses. My real concern is with commercial AI systems that are able to ingest the increasingly massive body of open access HSS scholarship on the web to generate new works—without crediting the original authors much less remunerating them in instances where remuneration is required. In short, what recourse do authors have should they suspect that their intellectual property is being used in ways that violate the open license?

Thanks, again.

Peter Potter

From: Rachael G Samberg <[log in to unmask]>

Date: Wed, 20 Mar 2024 15:34:50 -0700

Dear Peter,

Many thanks for reading our blog post. I write now to address what seem to be misunderstandings in your response, and I hope I manage to address them in a way that illuminates more alignment than you think. Because legal concepts can be tricky, and because the readership for this listserv is mostly non-lawyers, I will also try to parse some concepts and terms that got confused in your reply.

First, though, I want to make clear that I am writing this reply in my personal capacity, and not on behalf of the University of California. With that said:

You refer to the right of authors not to have their copyrighted works “exploited” by TDM and AI usage. This is not correct. Setting aside the loaded use of the term “exploited” (I’ll just presume you meant “used”!) the fair use of copyrighted works is expressly authorized by 17 USC § 107. And that fair use provision does not afford copyright owners any right to opt out of allowing other people to use their works fairly.

Of course, this is for good reason: If content creators were able to opt out, the provision for fair use would be eviscerated, and little content would be available to build upon for the advancement of science and the useful arts. In all events, fair use exists as an affirmative statutory right for authors and readers alike, so that anyone can use copyrighted works fairly—regardless of whether any individual creator wanted them to be used or not.

In turn, your message suggests that scholarship published with a CC-BY-NC-ND license should be protected against “derivative uses” like TDM and AI. I’ll explain why this isn’t so.

I’ll start with the misplaced reference to “derivative uses.” A use is a use, and uses that are fair uses are statutorily protected. Assuming you instead meant “derivative work”, then it’s important to understand that conducting TDM, and using AI to conduct that TDM, does not create a derivative work. A “derivative work” refers specifically to the creation of a new work that incorporates a previously-existing work. TDM, and using non-generative AI to conduct TDM, provides insights and understandings about existing works through the creation of derived data, results, metadata, and analysis; these are not “derivative works.” And in all events, the right of someone other than the copyright owner to create a derivative work is permitted if the creation of the derivative work falls under an exception like fair use, as TDM research does. In all events, creation of derivative works is typically already expressly precluded by content license agreements anyway.

Turning next to the stated desire to safeguard “CC-BY-NC-ND” works against any TDM and AI uses (presumably because you think that such a CC license indicates authorial intent not to have the work be used), one should understand that the affirmative application of any Creative Commons license—or no application of a license at all (i.e. “all rights reserved”)—has no bearing at all on whether fair uses are permitted to be made of the work. A Creative Commons license applies and comes into play only when one is going beyond statutory exemptions like fair use. See: https://creativecommons.org/licenses/by/4.0/legalcode.en: “Exceptions and Limitations . For the avoidance of doubt, where Exceptions and Limitations [e.g., 17 USC § 107], apply to Your use, this Public License does not apply, and You do not need to comply with its terms and conditions.” (See also the explanatory FAQ, confirming the same.) As such: Authors cannot use CC licenses to control TDM and AI fair uses, and conversely scholars needn’t worry about whether a work has a Creative Commons license so long as the scholar is making a fair use.

If I may be permitted to reflect on what I think you actually intended to express: It’s that you wish to prohibit any use of generative AI, because the outputs of generative AI might exceed fair use (as no one disputes). I’ll explain why banning all generative AI use and training merely to prevent certain outputs in this fashion is overreaching in a bit. But in the meantime, I want to focus on the implications of that “no generative AI” sentiment for the rest of TDM and non-generative AI research uses; that is, to underscore how TDM and non-generative AI simply don’t come into play for what seems to have concerned you.

TDM: TDM research relies on automated methodologies to surface “latent” trends or information across large volumes of data. Every single court case that has addressed TDM research in the contexts at issue here has found it to be fair use. There is no “profiting” from conducting fair use TDM in the manner at issue in our licenses.
Non-Generative AI: TDM research methodologies can but do not necessarily need to rely on AI systems to extract this information. Sometimes algorithms can just be used to detect word proximity or frequency, or conduct sentiment analysis. In other instances, an AI model might be needed as part of the process. For instance, I’ve been working with a professor for several years as he studies trends in literature and film. Right now we have a Mellon Grant project for him to study such matters as the representation of guns in cinema. In order for him to assess how common guns are, and the types of circumstances in which guns appear, he has to find instances of guns in thousands of hours of movie footage. To do that, he needs an algorithm to search for and identify those guns. But first he has to show an AI tool what a gun looks like, by showing it some stills of guns from a small corpus of films, so that the AI tool can learn how to identify a gun before it then goes and looks for other instances of guns in a much larger body of works. This is a classification and regression technique called discriminative modeling. And it involves AI, but not generative AI, as the AI is not creating new images or footage of guns—as part of his TDM research. And once again, scholars have lawfully relied on this kind of non-generative AI training within TDM for years under the fair use doctrine.

So with this understanding, perhaps we can refine what you want: Perhaps what you actually want to avoid is scholars profiting from using your licensed products with generative AI to make new works. No problem: We’re not licensing your content for scholars to do that anyway. We’re licensing your content for fair scholarly and research uses. Any acts beyond fair use, or whatever additional rights are carved out in the agreement, would violate the license agreement anyway.

Okay, let’s refine the wishlist further: Maybe you don’t want scholars to use generative AI in a way that releases trained AI to the public. No problem again: Our adaptable licensing language can preclude that. Indeed, with language that we have already successfully secured with publishers, we impose commercially reasonable security measures and prohibit the public release or exchange of any generative AI tool that has been trained, or any data from such a generative AI tool. Certainly more aggressive licensing language could preclude the training of a third-party generative AI tool altogether—though there would be no need for such measures, as long as the license agreement prohibited the public release or third-party exchange of any trained tool or its data, and added further assurances of appropriate security measures.

To that end, I think one thing that is lost in your message is the difference between use of a generative AI tool, and training of a generative AI tool. Using a generative AI tool means: You have a corpus of works, you ask the AI a question about the works, it tells you the answer. Training AI differs in that the act of asking the AI questions, and the content you show it to answer the questions, actually helps the AI learn how to give better answers or improves the tool in some way. And this is where your message muddles the notion of “plagiarism.” The mere use of generative AI without also training the underlying generative AI tool has no implications for whether a publisher’s content will be (to use your word) “plagiarized”—i.e. there is no embedding of licensed content in the tool even if public release of a tool were ever authorized.

The availability of copyrighted works for use in TDM (and AI reliance for TDM) in scientific research is already a reality that authors and publishers face in the European Union. The European Parliament considered whether to let copyright owners opt out of having their works used for TDM or AI, and decided unequivocally in the scientific research context not to grant any such ight, and further not to allow contracts to take away these rights. See article 3 of the EU’s Directive on Copyright in the Digital Single Market (preserves the right for scholars within research organizations and cultural heritage institutions to conduct TDM for scientific research); article 7 (proscribes publishers from invalidating this exception by license agreements); and the new AI regulations which affirm that publishers cannot override Article 3 / scientific research AI training rights. Publishers must preserve fair use-equivalent research exceptions for TDM and AI within the EU. Through the licensing protections we’ve outlined, they can do so in the United States, too.

I hope this response furthers the understanding of how licenses will be used effectively to safeguard publishers’ (and authors’) financial interests, while also supporting scholarly research in accordance with statutory fair use rights.

Best,

Rachael

On Mon, Mar 18, 2024 at 8:04 PM LIBLICENSE <[log in to unmask]> wrote:

From: "Potter, Peter" <[log in to unmask]>

Date: Sun, 17 Mar 2024 23:57:18 +0000

Thanks to Rachael and the team at UC’s OSC for sharing this document. I came away from it with a much better understanding of the concerns of libraries as they try to account for TDM and AI when negotiating licenses for electronic resources.

It does, however, raise a question for me about the other side of the fair use argument—namely, the rights of authors to not have their copyrighted works exploited by TDM and AI usage. This is especially pertinent in the humanities and social sciences where much of the OA scholarship is published with a CC BY NC-ND license because of authors’ (and publishers’) concerns about others profiting from derivative use of a work. Increasingly, I am hearing from authors who want to know the extent to which the “no derivatives” part of a CC license protects them against TDM and AI usage, specifically generative outputs. I’m curious to know what folks think about the fair use question when it comes to authors specifically.

The UC OSC document acknowledges publishers’ concerns about misuse of licensed materials but then it seems to brush those concerns aside on the grounds that publishers “already can—and do—impose robust and effective contractual restrictions” on such misuses. But the document also admits that “overall fair use of generative AI outputs cannot always be predicted in advance,” which of course is exactly what authors are concerned about—usage by AI that is unpredictable and perhaps impossible to keep track of because of the increasingly sophisticated nature of AI. How does one prove plagiarism when AI systems have ingested and learned from so much content that it’s impossible to tease out what came from each individual source?

From: Rachael G Samberg <[log in to unmask]>

Date: Wed, 13 Mar 2024 05:38:00 -0700

Fair use rights to conduct text mining & use artificial intelligence tools are essential for UC research & teaching. Learn how University of California libraries negotiate to preserve these rights in electronic resource agreements: https://osc.universityofcalifornia.edu/2024/03/fair-use-tdm-ai-restrictive-agreements/

Best,

Rachael G. Samberg

Timothy Vollmer

Samantha Teremi