Content area
Copyright and computer science continue to intersect and clash, but they can coexist. The advent of new technologies such as digitization of visual and aural creations, sharing technologies, search engines, social media offerings, and more, challenge copyright-based industries and reopen questions about the reach of copyright law. Breakthroughs in artificial intelligence research, especially Large Language Models that leverage copyrighted material as part of training, are the latest examples of the ongoing tension between copyright and computer science. The exuberance, rush-to-market, and edge problem cases created by a few misguided companies now raises challenges to core legal doctrines and may shift Open Internet practices for the worse. That result does not have to be, and should not be, the outcome. This Article shows that, contrary to some scholars' views, fair use law does not bless all the ways that someone can gain access to copyrighted material even when the purpose is fair use. Nonetheless, the scientific need for more data to advance AI research means access to large book corpora and the Open Internet is vital for the future of that research. The copyright industry claims, however, that almost all uses of copyrighted material must be compensated, even for non-expressive uses. This Article's solution accepts that both sides need to change. This solution forces the computer science world to discipline its behaviors and, in some cases, pay for copyrighted material. It also requires the copyright industry to abandon its belief that all uses must be compensated or restricted to uses sanctioned by the copyright industry. As part of this re-balancing, this Article addresses a problem that has grown out of this clash and is undertheorized. Legal doctrine and scholarship have not solved what happens if a company ignores website code signals such as "robots.txt" and "do not train." In addition, companies such as the New York Times now use terms of service that assert that you cannot use their copyrighted material to train software. Drawing on the doctrine of fair access as part of fair use, we show that the same logic indicates that such restrictive signals and terms should not be held against fair uses of copyrighted material on the Open Internet. In short, this Article rebalances the equilibrium between copyright and computer science for the age of AI.
ABSTRACT-Copyright and computer science continue to intersect and clash, but they can coexist. The advent of new technologies such as digitization of visual and aural creations, sharing technologies, search engines, social media offerings, and more, challenge copyright-based industries and reopen questions about the reach of copyright law. Breakthroughs in artificial intelligence research, especially Large Language Models that leverage copyrighted material as part of training, are the latest examples of the ongoing tension between copyright and computer science. The exuberance, rush-to-market, and edge problem cases created by a few misguided companies now raises challenges to core legal doctrines and may shift Open Internet practices for the worse. That result does not have to be, and should not be, the outcome.
This Article shows that, contrary to some scholars' views, fair use law does not bless all the ways that someone can gain access to copyrighted material even when the purpose is fair use. Nonetheless, the scientific need for more data to advance AI research means access to large book corpora and the Open Internet is vital for the future of that research. The copyright industry claims, however, that almost all uses of copyrighted material must be compensated, even for non-expressive uses. This Article's solution accepts that both sides need to change. This solution forces the computer science world to discipline its behaviors and, in some cases, pay for copyrighted material. It also requires the copyright industry to abandon its belief that all uses must be compensated or restricted to uses sanctioned by the copyright industry. As part of this re-balancing, this Article addresses a problem that has grown out of this clash and is undertheorized.
Legal doctrine and scholarship have not solved what happens if a company ignores website code signals such as "robots.txt" and "do not train." In addition, companies such as the New York Times now use terms of service that assert that you cannot use their copyrighted material to train software. Drawing on the doctrine of fair access as part of fair use, we show that the same logic indicates that such restrictive signals and terms should not be held against fair uses of copyrighted material on the Open Internet.
In short, this Article rebalances the equilibrium between copyright and computer science for the age of AI.
I. INTRODUCTION
Predictions from legal scholars that machine learning, the data machine learning needs, and copyright were headed for a clash with fair use, have come to a head.1 Large Language Models (LLMs) need huge amounts of data, and that data are often protected by copyright. LLMs are the fuel behind popular, revenue-generating services from companies such as OpenAI, Anthropic, Cohere, and AI21 Labs, among others.2 The fair use analysis around non-expressive use of copyrighted material-i.e., using copyrighted material to train software-has sound logic and appears strong.3 Yet as scholars have noted, the advent of LLMs and what they can do has highlighted "uncertainties" around the doctrine such that the potential for extreme changes to the law are possible.4 Aggressive deployment of LLM-driven "generative AI" services and products built on copyrighted materialincluding illicit book libraries-means those possible extreme changes are now upon us. Some options are severe enough to harm AI research and, indeed, change the nature of the Open Internet,5 but that does not have to be the case. Put simply, the current lawsuits and proposed laws over applications, such as OpenAI's ChatGPT, threaten basic research in AI and miss important points about the technology behind the offerings. Yet they also raise strong questions about the law, ethics, and future of AI. As such, in this Article we look at the science and business practices around AI research and offer a way to maintain the balance between copyright and computer science.
OpenAI's ChatGPT has astonished the world and seems to be a leap towards artificial general intelligence (AGI) machines6-but it is not.7 Even OpenAI's CEO, Sam Altman, has admitted that the term is "ridiculous and meaningless."8 Indeed, ChatGPT is only one offering of software based on research in natural language processing (NLP) and specifically LLMs.9 LLMs leverage patterns in data to predict what might come next in a sentence, paragraph, or essay, sometimes with incredible fluidity and accuracy.10 But these offerings are not magical AI.11 They are services and products. Nonetheless, claims-or perhaps faith-that researchers are pursuing AGI matter because people can forget the difference between how academic research makes progress on AI research and the reality of a non-academic group turning research into products and services.
Although LLMs are not AGI, the quest for it, or what was originally called "genuine intelligence,"12 animates the motives and methods behind LLMs and other academic AI research. Understanding those motives and methods helps understand the legal and ethical implications of AI research in general, as compared to commercial offerings based on AI research. In simplest terms, calls to limit or regulate AI research-such as charging for copyrighted material used in training software or only allowing data scraping when the data are not used for training-are an over-correction that would harm future, desired AI research, as called for by President Biden's recent executive order on AI.13 Nonetheless, AI researchers need to understand that just because they can do something does not mean they should.14
Thus, this Article explains the theory and practice behind LLMs so people can recognize facts over fears and fictions and offers a plan to resolve
the legal and ethical issues around the future of LLMs and AI research. Part I lays out what LLMs are and explains the technical and theoretical aspects of LLMs. It explains why vast amounts of data-such as book corpora and large crawls of the Internet-are vital for AI research. It also explains the technical aspects of LLMs, including the relationship between generalization and memorization. Once one understands that relationship, one can understand the differences between LLM research seeking to emulate language and LLM research that wants to be a commercial answer-providing product. That understanding is necessary to unravel the legal issues around LLM research and products. Part II explains what happens when corporate researchers forget they are not in academia. Although the needs and practices of academic and commercial AI research often converge, they can diverge in some critical ways. Specifically, cultural and practical limits around academic research reduce potential harms and legal issues compared to non-academic research efforts. This difference exists because non-academic research efforts use so many resources that commercialization and effects on markets are almost inevitable.15 In short, solutions to how computer science uses copyrighted material should accommodate academic needs and manage commercial issues, but current approaches fail to appreciate the difference between the two endeavors.
Part III accepts that all AI research is under scrutiny and presents a legal and ethical guide for gathering and using data for research. This Part focuses on legal issues particular to books and to Internet data. It explains that the better the breadth of either type of copyrighted data, the better an LLM can capture a diversity of voices and diversity of knowledge. Nonetheless, this Part argues that the assumption that one can use hundreds of thousands of illegally copied and obtained books is precarious and is position contrary to some legal scholars. 16 It adds to the literature by addressing evolving issues around expanded use of code signals-such as robots.txt, no crawl, and do not train-and restrictive terms of service. This Part shows that these changes and uncertain law regarding the penalties for ignoring such restrictions not only threaten AI research, but signify a shift from an Open Internet to a more closed, permission-based one. We argue that we should not repeat the mistakes of letting the copyright industry dictate terms for computer science research, because that would create a new wave of litigation and what Professor Heller called a "Gridlock Economy."17 This Part concludes showing how the law around platforms, intellectual property harm mitigation, and mistakes in Silicon Valley research, provide a set of practices that can help protect commercial entities that use copyrighted material to train their software. Part IV presents solutions to maintain and expand access to books and to maintain an Open Internet. For books, we argue that if courts hold that one needs a legitimate copy of a book to use it as a data source, courts should also rule that using tools to circumvent digital rights management is allowed when the end purpose is fair use. Nonetheless, asking all researchers to buy digital books or hard copies and scan them, as Google did in its Google Books Project, is inefficient. As such, we offer a radical solution informed by our analysis of the contours of fair use and the Google Books and HathiTrust cases: society should offer a secure, training book repository for research.18 Given that libraries have already worked with Google and HathiTrust and that the digital libraries are fair use under court decisions, the libraries could work with either project to offer such a service at prices designed to enhance access and provide a share to authors rather than to be a font of cash. Our solution leverages a point about the desired research: one does not have to give the data to the researcher. Instead, the repository could host the data securely and allow the researcher to train their model on the data. Rather than "shadow libraries" floating around the Internet, the data would stay in the repository. As this option cannot be mandated, Congress may need to pass a law setting up the rates and payments. Given that hurdle, we offer a simpler option based on the Audio Home Recording Act. There should be an added fee on book sales collected and distributed to authors, but only if the law also enshrines using book corpora even if they come from shadow libraries. To maintain an Open Internet, we offer the logic of the Supreme Court's decision in Campbell v. Acuff-Rose Music, Inc.: 19 that one need not ask permission when the use is fair, and the fact that access was legal should mean that, unless Internet content is behind a paywall, data scraping for software training should be deemed fair use regardless of the code signals or terms of service put on such data. The Article then concludes.
II. THE WHY AND HOW BEHIND LLMS
Large datasets are a cornerstone not only of LLM research, but of modern AI research in general.20 Calls to ban using copyrighted materials, such as book corpora or text and images on the Web, to develop AI threaten fundamental AI research. Charging high prices for access to those copyrighted materials threatens such research, too. Understanding the roots of artificial intelligence research explains why people pursue AI research, including LLMs, and why such research needs large datasets.
The attendees of the famous Dartmouth conference of 1956 were fascinated by the idea of creating a thinking machine and convinced that such a goal was attainable. One professor, John McCarthy, coined the term "artificial intelligence."21 No one liked it, in part because the goal was to create "genuine intelligence."22 That goal raises a key, unanswered question: what is intelligence? The proposal for the first conference argued that "every aspect of learning or any other feature of intelligence can in principle be so precisely described that a machine can be made to simulate it."23 Although the idea seems to imply only one path, even that initial workshop listed several possible paths forward. Attendees pursued a range of theories including "Neuron Nets," which emulate the way humans create concepts; "Self-Improvement," the concept that a machine is "truly intelligent" if it learns on its own; "How Can a Computer Be Programmed to Use a Language," which investigated emulating human thought as it operates within human use of language; and "Randomness and Creativity," or the ability of a machine to imagine new things with some randomness guided by intuition so that the outcome is efficient. 24 Even now, there is no consensus on the path to AGI. Breakthroughs in one area can, however, lead people to assert, if not believe, that AGI is here. The recent changes in how machines use language are an example of such exuberance.
Language is considered a defining aspect of being human thinking. 25 LLMs are part of NLP, a core area of AI research that tries to emulate the
way humans use language. If you can show that a machine can use language similar to how humans use language, you might claim you have taken an important step forward on one of the paths to machine intelligence set out at the Dartmouth conference. The evolution of NLP methods shows how we arrived at the LLMs of today.
Because LLMs use techniques common across AI research, that understanding gives insights on the data and computational needs for much of AI research. LLMs require large amounts of training data and computational processing power to achieve state-of-the-art results.26 Those needs go to the heart of questions about how that research is done. Put differently, NLP and related LLM research need vast amounts of data, and cutting off access to such data poses threats not only to commercial LLM research but also to future AI research in general.
A. The Evolution of Natural Language Processing, a Vital Part of the
Quest for AGI
Early NLP work relied on "symbolic rule-based approaches-that is, programs that were given grammatical and other linguistic rules and applied these rules to input sentences."27 That approach fits within the idea that humans use rules and variables as we speak.28 If one can capture those rules and variables, one can also program a machine to use language the way a human does. This approach seems elegant because it implies that language is produced through enumerable procedures and does not require a huge amount of data. As with other areas of AI research that followed symbolic approaches, NLP systems were, however, found to be "brittle."29 That is, regardless of how many rules about language the machine has, the machine will not be able to understand or produce certain expressions the way humans can.30 The symbolic rule-based approaches revealed that humans can often violate the formal rules of grammar, but the expression can still be comprehensible.31
To address this brittleness, NLP researchers in the late 1980s and 1990s turned to statistical approaches that capture the probabilistic relationship between words.32 The idea, as counterintuitive as it might be, is that the meaning of words, and/or the grammatical relationship between words, is irrelevant if the words follow patterns.33 This concept can be traced back to Claude Shannon's seminal work on how to define information.34 According to Shannon's theory, "the informative content of a message is directly related to the degree of uncertainty regarding its content. Predictable messages have a low information content; unexpected or surprising messages have a high information content."35 The focus shifts to what you can learn from the message.
When the message is expected, there is nothing new and so there is nothing to learn. If human writing has regularity, then it should be learnable and non-random. We capture this with the mathematical notation P(Wn|W1, W2, . . . ,Wn-1), which reads: the probability of the last word in a sequence (the nth word) is dependent on the words that precede it from the first word to the second-to-last word. Intuitively, this makes sense. When encountering the sequence of words, "the deep blue," we expect the word "sea" to be highly likely to appear, but we do not expect the word "cucumber" to be likely to appear. This equation has a name: a language model.
A language model captures the statistical probability of a sequence of words-the context (W1,W2, . . . , Wn-1)-and its successor (Wn).36 The term was first introduced in 1976, but it was not until the late 1980s that computers became powerful enough to extract large probability tables from raw text data.37 Early language models were simple.38 The simplest is called the bigram model, which probabilistically relates a word to the immediately preceding word: P(Wn|Wn-1). 39 If we imagine 50,000 words in the English language (an underestimate), then we need a table with 50,000 × 50,000 numbers to store all the probabilities between pairs of words. To "learn" these probabilities, you must simply fill in this table by scanning through a corpus of human-written text and count how often one word is preceded by another word.40
Despite being computationally expensive, this simple bigram model can be used to implement the autocomplete functionality that we have on modern smartphones. Autocomplete gives a small sampling of highly probable words for you to choose from, instead of typing each word from scratch. Nonetheless, being able to predict a word based on only the previous word has some limitations. If we only look at the previous word to predict the next one, "the deep blue" and "a sweet blue" are indistinguishable statements. As such, there is a need to look beyond the previous word, which is what happened next in NLP research.
After the development of the bigram model, more complicated models have been developed such as the trigram, which gives the probability between a word and the two words immediately preceding it: P(Wn|Wn-2,Wn-1). 41 This model can capture the difference between "deep blue" and "sweet blue" to predict "sea" and "berry", respectively. This model comes at greater computational cost, now requiring 50,000 × 50,000 × 50,000 probability values. Autocomplete becomes a better guesser of words as the model goes from bigram to trigram and beyond to 4gram and 5gram.42 Although no one would confuse a 5-gram model with human writing, it works well enough that it is still used for many applications, such as modern-day autocomplete functions that can predict the next five words in a phrase. These models are useful, but if we are trying to reach a point where a machine can use language more like humans, the models come up short.43 In simplified terms, the models "forget" what was "said" more than five words ago. To move beyond that limit requires an ability to build models with more context, which leads us to the next step on our way to LLMs.
As a language model brings in more context, it is no longer practical or feasible to count words and build increasingly larger probability tables. The solution to this problem is to use a particular type of machine learning called artificial neural networks, which learn compact approximations of an underlying phenomenon. The amount of memory-or size of the neural network-can be traded off for accuracy. That is, as we allow the model to get larger, its approximation gets closer to the true underlying phenomenon. In the case of language modeling, the neural network approximates word probabilities instead of looking up precise values in a table.
Although there has been a considerable evolution of work on neural network language models from the late 1990s, the key breakthrough was a particular type of neural network called the transformer. 44 Transformers are possible in part because of the recent availability of very large cloud computing systems and very large amounts of easily attainable text on the Web, scanned books, and other sources. The amount of data and computing power needed for transformers are big, resource-intensive changes, compared to the 50,000 x 50,000 word probability tables where NLP research started.45 Because transformers are at the core of LLMs, we now turn to how they work.
B. LLMs: The Next Big Leap for NLP
The "transformer" is so named because it incrementally transforms an input sequence of data, called tokens,46 in a way that makes it easier to guess the next word in the stream.47 It does this with a mathematical concept called self-attention.48 While we will not detail how this works, we will convey the intuition about how and why it works.
1. Tokens Need Large Datasets and Computing Power
Consider a sentence such as "Sam rode his dirty bike home." Each word is a token (sometimes longer words are broken down into several tokens). Each token is initially replaced by a unique string of numbers. For example, "dirty" might be <0.12, -1.7, 0.23, . . . , 1.55> with a total of 512 numbers, called an embedding. Through repeated practice, the transformer has learned that certain words go together-or "attend" to each other-in particular ways due to the relative regularity of human language (English in this case).49 For instance, "dirty" modifies "bike" but not "his." Attention puts words that attend to each other together, combining their numerical values in a way that can be metaphorically thought of as creating new words through the merging of words. "Dirty" and "bike" becomes a word that might convey the concept of a bike that is dirty; "I" and "my" are combined; "rode" and "Sam" (the subject) and "bike" (the direct object) become a new concept, meaning roughly that Sam is riding a bike.50 The self-attention transformation is applied several times over, creating more and more complex concepts: Sam did not just ride a bike, Sam rode something that was conceptually a bike that was also dirty.51 As you might imagine, being able to make up specialized terms and concepts to mean complex combinations of verbs, entities, and relations means that by the time this process is done, the model has a much better shot at correctly guessing what might possibly come next.
Because words in this context are just strings of numbers, the transformation and guessing processes involve a series of mathematical matrix multiplications involving, in the case of ChatGPT-3.5, around 350 billion numbers that we call parameters.52 These matrix values are multiplied against the word embeddings to produce values from 0 to 1 on a merger scale. Some of these matrix multiplications produce values close to 1-indicating the words have a high probability of attending to each other, and thus should be connected when trying to predict what value comes next-or produce values close to 0-indicating that the words have a low probability of attending to each other and thus should not be connected when predicting what should come next.53 But how does the transformer know when to attend and when not to attend? Initially, the parameters are set randomly, meaning words randomly attend to each other. This random configuration produces nonsense mergers that do not increase the ability of the model to guess the next word. Nonsense mergers are likely to cause the model to make an error, which is measured using the mathematical concept called entropy, a measure of how surprised the model is by the correct next word.54 This error score is used to figure out how to tweak the parameters so that next time the matrix calculations do a better job at figuring out which tokens attend to each other. Overall, the goal is to reduce surprise within the context of the software's purpose. Surprise is handled in two ways: generalization and memorization.55 We now turn to the way generalization and memorization work.
2. Reducing Cross-Entropy: Generalization versus Memorization
There are two ways to reduce entropy: generalization and memorization. Generalization happens when the model learns patterns that can be applied to situations it has never encountered.56 For language, generalization means that the parameters of the model have been configured to capture some generalized concept about how words relate to each other. In English, verbs are often preceded by subjects-the do-er of the verb. People who talk about things being dirty often follow with talking about things getting washed. People who talk about things being dropped often talk about things falling. Roughly, generalization is a property that supports emulating the way humans use a range of words and put them together into meaningful sentences. This property has important implications for those who wish to make scientific contributions about the emulation of human language composition because-as in the case of the Turing Test-a human-like text composition system may have to respond to novel situations.57 For example, we might want a chatbot that can respond to a request to describe what would happen to a cat if it were to find itself on Pluto, which might not be something that the model encountered in its training data.
As such, there is a simple fix for text LLMs that are supposed to aid composition but may produce possibly infringing outputs such as entire books or chapters of books: do not use the whole book.58 Instead, use a good sample of a book, such as cutting every fifth page of a novel or other type of book and disallow memorization of large chunks of text.59 Or one might skip every fifth paragraph of a text if smaller granularity is desired. These tactics would allow the LLM to learn the pattern of the language without being able to return a large reproduction of the underlying data. But generalization means the LLM is not trying to be precise. If one wants to address specific topics and be accurate about them, generalization can come up short.60 Prompts for summaries of a specific book or other data do not flow from general language composition. That is where memorization comes in.
Memorization happens when the model cannot find a general pattern,61 such as is the case with facts such as the birthplace of Alexander Hamilton. There is no general principle that could be used to guess the birthplace, so the only way to reduce entropy when trying to guess Hamilton's birthplace is to memorize the answer. Thus, memorization is a valuable property of LLMs if one is attempting to commercialize a question-answering system that can respond to questions with accurate knowledge.
In a way, memorization is not about prediction; it is about exact reproduction of passages within the training data.62 Memorization means that some parameters in the model have been set to only work with a very specific sequence of input tokens and produce a very specific output token. For example, imagine you are introducing a professor at a conference, and you want to generate their bio. The professor works at the University of Big State. There is no general property of the universe that would predict what university they work at. To predict a token like "University" followed by "Big" followed by "State" and reduce entropy, the model would need to have learned a very specific pattern that only applies to a very small set of possible inputs. Put differently, that specific data are part of Long Tail data, which are not easily predicted, but necessary for greater accuracy.63 An LLM like ChatGPT will have to memorize the professor's bio to generate the bio accurately.64 But it is a mistake to think generalization and memorization are siloed aspects of LLMs.
Generalization and memorization work together. Generalization may be used to compose the more formulaic parts of an answer to a question that contains a memorized fact. Thus, a hopefully good answer to the question about what would happen to a cat if it were to find itself on Pluto would combine generalized knowledge of grammar and language expression with memorization of facts about Pluto so that the answer was coherent and applied common sense involving cats in cold environments.65 The tradeoff between generalization and memorization reveals that one must know one's goal if one wants to assess whether one's software worked and avoid potential legal issues.
3. Does Your LLM Work? Better Yet, What's Your Goal?
In NLP research, there are specific tasks with particular tests that determine whether progress is made.66 Tests include ways to assess common sense reasoning, question answering, textual entailment, and the ability to use language "in a way that is not exclusively tailored to any one specific task or dataset."67 Whether the system generalizes, memorizes, or does both affects how well the system performs on these tasks.
The combination of generalization and memorization is why modern LLMs are especially good at following directions-the prompt may include clues as to what the user wants to see generated, which we might think of as instructions. That is, the more the prompt signals a pattern that triggers generalized and/or memorized parameters, the more an LLM can deliver plausible results. For example, if you want a scholarly essay about the Lincoln-Douglas debates, you need to account for the probability that the model was likely trained on a large number of example homework assignments available online. The prompt "Write an essay about the views in the Lincoln-Douglas Debates" may produce an output that notes this rough rule: "The essay should be five pages and cite sources," because the LLM's training data have a lot of examples of homework assignments and test questions. However, if the original prompt is followed by "suitable for scholarly publication," you are more likely to get an answer to the prompt versus an elaboration of possible assignment instructions.68 But what if you are looking for something even more specific?
Suppose you want an essay about the reasons the North entered the Civil War and how the evolution of Lincoln's views on slavery influenced the decision to go war. An LLM may have been trained on an excellent scholarly article or book chapter on exactly these ideas. As opposed to a search query where a system will reach out to the Web to find a source that seems to match your query and point you to that source, the LLM may find the prompt fits so well with the essay or book chapter that it returns most, or even all of the source. As a test of a system that can "converse" with the user as a human, the LLM has done well. Indeed, it will have done so well that its output exceeds the ability of most people on the planet, including many well-educated people unless they are quite well-read or experts on the Civil War. Yet, in this situation, we have moved from prediction to something else. With enough data, your LLM may be memorizing and giving memorized answers that fit a rather precise text, rather than predicting what comes next. In that sense there is a paradox. Your system is not really predicting language outcomes or generating creative outputs. Instead, it is reproducing data. A scientist might care because the elusive idea of AGI is in fact further away than one hoped. A commercial lab may have to admit that it is not near AGI and that the term is silly.69 These LLMs are not necessarily predicting language, let alone becoming a version of human thought that some in the field hoped for. Such LLMs offer commercial upsides even though they are a far cry from being AGI.
Indeed, Microsoft's initial use of OpenAI's software was consumer-focused as part of Microsoft's search engine, Bing.70 The goal was to displace Google's search engine dominance because a ChatGPT powered Bing would be able "to formulate and write out answers to questions instead of just putting out a series of links."71 That result may be something consumers like, but it is essentially a somewhat richer way to present answers to a search query.72 As further evidence of commercial goals of LLM companies, during the editing of this Article, OpenAI announced SearchGPT, OpenAI's foray into the lucrative search space.73 Similarly, Perplexity began an ad revenue sharing model with news publishers when users' search queries (aka prompts) generate results that cite news sources.74 Whether LLM-driven results are accurate, and whether they violate current customs and laws, are separate issues. More generally, the methods and goals for academic output as opposed to commercial output are quite different in ways that matter for AI research and regulation going forward.
III. THE DIFFERENCE BETWEEN ACADEMIC AND COMMERCIAL RESEARCH
The contexts and goals of building a given LLM compared to that LLM's outputs highlight the tension between academic research and commercial application of research. An academic goal may be to emulate human use of language. A commercial goal may be to offer a service that users enjoy and for which they are willing to pay. Those differences matter when considering what rules should govern how we build AI software and the implications of, if not the liability for, outputs.
Corporate practices that ignore the realities of ethical and legal constraints potentially invite lawsuits and ill-advised regulation. The problem is that regulations and other actions can affect both ill-conceived corporate acts and desired academic work. This section explains the differences between academic and corporate research to illustrate why recent reactions to corporate missteps should be tempered to accommodate academic work.
A. Academic Research and Its Goals
Imagine it is 2015 and you are at a university working to advance NLP science. In simplified terms you need data, code to create and train a model, and software that implements the intended purpose of the system (be it a chatbot, or something else) by running inputs through the model and processing the outputs. But before you can run the system software, you need computing power to end up with a trained final model. The necessary data and computer power may start small, like in the case of OpenAI's GPT-1, which used roughly 7,000 books in BookCorpus and had 117 million parameters.75 However, the needs often grow. After all, more data in the dataset will increase the model's performance, and more data require more computing power to process.76
Assuming you have the data and computing power, you will run your experiments and publish a paper. That paper will provide ways to test your results. You will either share the data and code to create the model and/or you will share the fully trained model so the academic community can verify your results through replication. Unlike commercial research, in a way, the work is done once your paper is published and the results are verified. The next step is to see what you or someone else can do to push the science and outcomes further by improving on the software. Thus, you will probably need more data and, by extension, more computing power.
Data are relatively cheap, and computing is expensive. If you want a larger dataset, you will likely scrape the Internet, use a resource like CommonCrawl, and/or add in almost any other source of data that you can. Once the data are gathered, it needs to be tokenized and run through a transformer77 to see how well it accomplishes your goal. Your goal could be predicting what comes next in a string of text larger than a phrase-perhaps shooting for a sentence, a paragraph, or even a short essay.78 Your goal may be to see whether the system can interact with humans in ways that make a human think the system is in fact human.79 The research will likely involve running experiments. It may take 10, 50, or even 100 iterations on the design of the new model to find the one that works as well as desired.80 Those computations are expensive.81 Your work may stop, because the practical reality is that experiments that cost millions of dollars are not likely to be done at an academic research institution.82
Put simply, academic research is not usually funded with for-profit strings attached. As such, the model need not be deployed at scale with millions of users paying to use it. Likewise, there is less chance that someone misuses the model or that the model's undesired outputs affect a large part of the population. This situation differs, however, in commercial settings.
B. Commercial Research and Its Goals
Non-academic research eventually draws the best human talent and requires ever larger datasets, meaning expensive computing power will have to pay for those resources. Again, imagine it is 2015 and some folks start a private, non-profit research group to tackle the problem of funding needed for NLP research. As noted above, research labs will likely handle the first stage of research. The next stage is where problems occur. It is easy to underestimate the effect of these costs. OpenAI's early cloud computing costs were close to $8 million, which was about 25% of its annual functional expenses.83 Those resources allowed OpenAI to take the next step in its GPT development and pursue GPT-2. GPT-2 used BookCorpus, added 8 million web pages, and had 1.5 billion parameters, as opposed to the 117 million in GPT-1.84 The results were another step forward because "[t]he diversity of tasks the model [was] able to perform in a zero-shot setting suggests that high-capacity models trained to maximize the likelihood of a sufficiently varied text corpus [can] begin to learn how to perform a surprising amount of tasks without the need for explicit supervision."85 More simply, GPT-2 performed better at creating news stories, poems, answering questions, language translations, and other language tasks, than expected.
Once GPT-2 was released in early 2019, the interest in OpenAI's approach and how to improve it went up, and with this interest came the need for more resources. While the research question did not change much, the amount of data and necessary computing power increased again. A new need arose: the need to hire talent that might go to Google, Meta, and other corporations in the AI space. In short, OpenAI needed more money than nonprofits usually attract. By one estimate, OpenAI lost around $540 million in 2022.86 To address this need, OpenAI created a system that was supposed to keep its non-profit corporation in charge while allowing a for-profit company to operate under the supervision of the non-profit organization. That structure allowed OpenAI to offer equity to its employees and pursue partnerships, such as its one with Microsoft that began with a $1 billion investment.87
Unlike the academic world where return on investment is not an issue, OpenAI's scientific papers and accolades were not enough. The anticipated costs to build GPT-3 with 175 billion parameters, the promise of equity returns to employees of the for-profit, and investment oversight from Microsoft meant money had to be generated. Shortly after Microsoft's investment, OpenAI announced it was pursuing ways to license its technology for profit as part of having funds to achieve AGI.88 The need for returns has only increased given Microsoft's estimated $13 billion investment to date.89
Once OpenAI switched to commercialization, the difference between lab research and commercial products became acute. A core problem was releasing versions of the software and the API to the world. Rather than worrying about "harm[ing] humanity or unduly concentrat[ing] power" as OpenAI's Charter extolls,90 the company threw the model online to see what would happen. As OpenAI's CEO, Sam Altman said, "[w]hat I think you can't do in the lab is understand how technology and society are going to co-evolve . . . you just have to see what people are doing - how they're using it."91 This approach reveals that the company either had not thought about its goals and the repercussions of releasing its software widely, did not care, or was more interested in commercialization and the fees OpenAI charged for using its product. Recent revenue numbers-$1.6 billion annualized rate at the end of 2023 and $3.4 billion as of summer 2024-suggest commerce has been the driving force for some time.92 To be clear, OpenAI is not alone. Other players in this market such as Anthropic are also pursuing revenue and must do so to justify the investments they have received.93 Even if LLM companies are saying they have built a sort of Swiss Army knife of software that can be used for a wide range of purposes, the motivation for researching this software matters when it comes to understanding and mitigating potential harms and legal liabilities.
Put differently, corporate AI researchers are often trained in, or have a history with, academic settings, but once in a corporate environment, cultures and priorities change. Corporate research is far less restricted in resources and more restricted in outcomes. For example, both academic and corporate software research are data and computing power hungry. Corporate research, however, can access vast computing power, data, and money if it pursues outcomes aimed at commercial uses and products.94 After all, a corporation that invests hundreds of millions, if not billions, in research will expect returns, not Nobel Prizes or Turing Awards. No matter what a company says, once money with corporate strings attached is taken, the ability to follow scientific ethics as opposed to classic corporate ethics diminishes and, perhaps, vanishes. Nonetheless, the next section offers a set of questions to guide AI research in both academic and commercial settings going forward.
IV. A GUIDE FOR DATA IN LLM RESEARCH
Regardless of academic or commercial contexts, AI research is under scrutiny. AI researchers need to understand the ethical and legal boundaries of their research and how they deploy it. Although those boundaries may change depending on the context, we offer a set of questions that can guide research in both contexts.
A. What Are My Data Inputs?
AI research must pay more attention to how it gathers data. The recent outcry over commercial generative software that OpenAI and other vendors have rushed to market has put new pressure on the rules regarding what is allowed when gathering data to build an AI model.95 Much of the furor flows from authors' lawsuits over the use of copyrighted material to build the models and the systems' early outputs. As people found that some LLM software outputted results that might infringe copyright law-such as several pages of a novel or visual artwork that treaded close to infringing existing copyrighted visual art-they were outraged about how LLMs are built. Even before the focus on LLMs, there have been ethical questions about whether to honor the robots.txt code96 and legal questions about using unauthorized data from "shadow [web] libraries."97 But given copyright's fair use doctrine and arguments that even access to illegal copies can be fair use, researchers tend to gather their data in this way.98 This approach has been customary and accepted as fair use until OpenAI's recent reckless behavior.99 That behavior has spawned a reaction that creates ethical, if not legal problems, for AI researchers.100
The first step in any AI research is to gather data. Because models tend to perform better with more data, researchers will look for larger datasets. In LLM research, the method itself mandates larger datasets. Recall that the shift from symbolic approaches to statistical approaches, such as bigrams and 5-grams, demands a decent dataset that is computationally expensive.101 Yet that approach fell short of using language the way humans do. The next step of using machine learning and artificial neural networks requires even more data and computational power. The shift to transformers necessitates even more data. This need is not simply a hunger for so-called "big data." The models are looking for large contexts to better predict what words will come next. Thus, ChatGPT's evolution in size relates to how well it functions, and, indeed, each larger model performed better than the previous in significant ways. But where do the data come from?
How we choose what data to include creates an ethics paradox. More data means better performing models, but the sources of that data may be covered by copyright. The wide-breadth of copyright protection means that any writing, including writing posted to the Internet, is protected by copyright. In addition, online sources may have explicit, coded requests not to crawl the site or use the material for software training. Objections to using copyrighted material can range from absolutist-never use my writing to train software-to economic-you should pay to use my material.102 We explore the differences between books and the Web to highlight the issues around using copyrighted material as software inputs.
1. Books as LLM Inputs
Legal precedent from the Google Books and HathiTrust projects establishes that a commercial entity can copy copyrighted material-even millions of books-to index the material and enable metanalysis.103 Even before those cases, copyright law has deemed copying without permission as fair use to reverse engineer software, build software to detect plagiarism, and in a general sense when copying is for "non-expressive uses of copyrighted works."104 As recent legal scholarship has explained, the problem comes from the outputs of such training, not the training itself.105 We agree with that analysis. Nonetheless, OpenAI and other companies' reckless behaviors in releasing generative AI such as ChatGPT have called customary and accepted fair use practices into question. These reactions offer a chance to explain why data are needed and where choices in data gathering may matter more for future software development.
Assume one wants to use only public domain material, that is, material not under copyright. Under U.S. law in 2024, the safest bet is anything published before 1929, because the copyright will have expired, and the work is in the public domain.106 For anything published after 1929, one would have to track down whether authors obeyed copyright formalities such as registering the work, publishing with a copyright notice, and renewing the copyright on time (with extra nuances depending on the dates when the work was published). 107 One would spend a good amount of time and money trying to track down whether the work is under copyright.108 Sticking with public domain material, however, presents other kinds of problems. Even if there is enough publicly-available data and one can avoid the costs associated with checking for copyright protections, solely relying on publicly available data limits AI research in dangerous ways.
A major critique of AI systems is that they lack representative or inclusive data. 109 Work published and available before 1929 is likely to be a less diverse set of authors, because diverse voices were not well-published, if at all, until recently.110 Diverse, as used here, should include the whole range of authorship from all corners of humanity.111 The paradox is that by respecting copyright, you would necessarily cut out the very voices that have asserted they are underrepresented in machine learning.112 Diverse should also include a broad range of topics and inquiry flowing from the expansion of published scientific knowledge since 1929. Imagine if the dataset was stuck with nonfiction and science from before 1929. The data will be quite biased towards an understanding of the world that is literally out-of-date and would miss, for example, the theory of chemical bonds, results from using the electron microscope, the discovery of nuclear fission, the science behind improving Mexican wheat and the resulting green revolution, the invention of the transistor, and the discovery of DNA-and this list simply stops at around 1950.113 Thus, having a larger book dataset is good for one's model because more data likely improves the model's performance on a pure predictive performance metric and allows the model to incorporate a larger range of voices. Yet, this goal still leaves open the question of how to get the book data.
Whether one's research is academic or commercial affects how one collects book data. Professor Matthew Sag argues that lawful access should not be required as a per se rule for non-expressive uses of copyrighted material, even for commercial defendants.114 In that spirit, he argues that if a company such as OpenAI could not obtain digital copies of text "without a contractual promise not to engage in non-expressive use, faulting them for obtaining a copy in the shadowy corners of the Internet might seem a bit churlish."115 This position raises at least two issues.
First, one does not need to ask permission to use IP if the use falls under fair use.116 As the Supreme Court has pointed out, "[e]ven if good faith were central to fair use . . . being denied permission to use a work does not weigh against a finding of fair use" because one might ask for permission "in an effort to avoid litigation."117 Yet, if one were told to "only use the data for non-expressive use" and one agreed to the terms, one could not use the material for an LLM if the purpose would be to create outputs that are expressive but not necessarily infringing on someone's copyright.118 As such, one would be more likely to reject the terms and proceed with one's offering based on the copyrighted work. Nonetheless, whether the use will be allowed requires further analysis.
Even when one does not need permission to use a work, acts that can be characterized as theft to obtain access to the work do not fare well in fair use analysis. In Wright v. Warner Books, the Ninth Circuit rejected an argument of bad faith and failure to gain permission in part because the material at issue had been given to the alleged infringer, "not stolen."119 The court contrasted the situation, however, with the sharp practices the Supreme Court condemned in Harper & Row Publishers, Inc. v. Nation Enterprises, 120 where the Nation used "purloined" letters instead of bidding on the market for the material.121 Furthermore, the Google Inc. case, which supports using copyrighted material for software training and non-expressive uses, noted, "Google provides the libraries with the technological means to make digital copies of books that they already own."122
Legal scholars point to touchstone cases about reverse engineering to support the ways LLM companies have gathered data. Those cases are, however, different than the book corpora at issue.123 In Sega Enterprises Ltd. v. Accolade, Inc., the defendant reverse-engineered video games by using "commercially available copies" of the plaintiff's work; not stolen copies. 124 In Sony Computer Ent. v. Connectix Corp., the defendant "purchased a Sony PlayStation console and extracted the Sony BIOS from a chip inside the console."125 Thus, cases about reverse engineering software did not involve theft as the reverse engineer could buy the material as step one in the reverse engineering process. The court in AV Ex Rel. Vanderhye v. iParadigms, LLC, held that the use of essays to train plagiarism software was fair use.126 But again, the access to the underlying copyrighted material was lawful.127
Although one may not need permission to use copyrighted material when the use is fair use, cases that support this position involved legally obtained access.128 Even when a court has found fair use where access was illegal, the fact of illegal access weighed against fair use. In NXIVM Corp. v. Ross Inst., the court found that posting a small portion of a single manuscript to the Web was fair use because the use was transformative.129 Nonetheless, the court still noted that when "access to the manuscript was unauthorized or was derived from a violation of law or breach of duty, this consideration weighs in favor of [the] plaintiffs" and against fair use.130 Although the case seems to allow unauthorized access if the use is transformative and overall fair use, the access and use was on a small scale compared to the book corpora used by OpenAI. A rule allowing access to books, even unauthorized copies of the books, aids AI research and fits within fair use ideals. Yet the assumption that OpenAI was proper because it followed academic customs is problematic.
OpenAI's use of a little more than 290,000 titles from "shadow [web] libraries" that host illegal copies of copyrighted books131 is, at least, bad optics and, at worst, bad management.132 Unlike most academic researchers, OpenAI had vast amounts of money for its research but chose not to spend the money on its data. In contrast, in the Google Inc. case, Google worked with libraries that had purchased the books and also invested in copying technologies to digitize the books.133 By one estimate, Google spent $400 million to digitize 25 million books or roughly $16 per book.134 Rather than taking a shortcut, OpenAI could have paid roughly $20 per new hardcopy of each book for a total of $5,800,000.135 OpenAI would then have had to invest in scanning technology and hire people to operate the scanners. As folks at the venture capital firm Andreessen Horowitz put it, the cost of hardware and software to build and train models is high, but "within reach of a well-funded start-up."136 The same logic should apply to the data that go into the model. It is expensive work, but expenses are a part of the marketplace.
The Google Books and HathiTrust cases provide clear guidance that one can copy all of a given piece of copyrighted material for training.137 They do not bless all the possible ways you might obtain that material. OpenAI's attempt to cloak itself within academic standards by creating a private academic-styled place with non-profit ideals was not a justifiable shield once the company took a commercial turn. The simplest lesson for LLM companies is to make sure to know where the data come from, and that it is a legally sanctioned source.
Even if fair use allows academic (and possibly non-academic) researchers to use book corpora available online, researchers should pay attention to how models built on such resources will be used. If the research is published and used in limited academic settings, the work should be protected under fair use. Things change if the researcher envisions starting a small company that might offer commercial products, or seeks to sell the company outright, like when Google bought Deep Mind.
Non-academic research must understand that it cannot assume that basic research is viable for commercial results. Even if one accepts OpenAI's initial claims that it was a nonprofit pursuing AGI for non-commercial outcomes, these claims do not justify creating a system that uses memorization and illicit data for its products. This point applies to Google, Microsoft, and any number of companies that employ excellent, former academic researchers who forget that they are no longer in an academic setting.
When research is put into practice as a product, the inputs and the outputs matter. If those inputs or outputs violate intellectual property law, there can be repercussions. Questions about what role the inputs played in building the software will arise, and fines and injunctions on using the software's outputs are possible. In addition, as the legal landscape evolves, software makers may find a sort of shadow on their product. This shadow may allow anyone with a copyright on the underlying data to argue the model is unauthorized, meaning payment is owed.138
Furthermore, OpenAI's actions highlight a legal and ethical dilemma for academic researchers using books for AI research: access to copyrighted material is going to be either expensive or come from illegal copies online.139 Professor Sag has said that as a practical matter, requiring lawful access to copyrighted material such as books is "churlish."140 Insofar as he means copyright holders will either say no, attach undesired limits, or ask for exorbitant amounts, he is correct. This assessment does not, however, address whether using illegal copies of books is fair use and, by extension, how researchers can continue to use shadow libraries without fear of lawsuits. Future laws may clarify that using shadow libraries is fair use. We think that a better option is possible: a centralized common access point for copyrighted books. Before we explain that solution, we turn to another important input for LLMs-the Internet.
2. The Internet as LLM Inputs
Data gathering may include book corpora, but it will also include even larger amounts of copyrighted work: writing and images on the Web. As is well known, these sources are also covered by copyright protection.141 Internet data is vital because of its size. For example, in one training of GPT-3, 432 billion tokens out of about 499 billion total tokens came from the Web.142 Before OpenAI's launch of ChatGPT and DALL-E, almost everyone on the Web wanted their Web pages indexed and findable, so they allowed their sites to be crawled by search engines.143 That crawling is what also allows someone to scrape the material for software training. But what if a website does not wish to be crawled? By custom, a website can place "robots.txt" into its web pages as a signal to search engines and other crawlers not to crawl the site. Obeying robots.txt instructions is, however, voluntary.144 Indeed, some archive groups ignore the signal because it defeats their preservation projects.145 Yet honoring requests not to crawl can still provide a large dataset, for now.
If one wants a dataset that represents a wide range of voices and perspectives, an often-used dataset, CommonCrawl, is a strong candidate to be in the training dataset. But, that resource is under attack. CommonCrawl claims to obey robots.txt and no-follow signals.146 Its November-December 2023 crawl release still has "3.35 billion web pages (or 454 TiB of uncompressed content)."147 Thus, it seems researchers can still access a robust dataset via CommonCrawl and respect online customs regarding how to use data. This situation is, however, changing.
Some important websites are adding what we call generally "do not train" signals to their web pages and,148 in some cases, to their terms of service.149 That is, websites are indicating that people can scrape and index the site, but not use that data to train software-what could be called a "scrape but don't train" signal. This change raises new questions for academic and corporate research. If CommonCrawl offers a dataset that has "do not train" data, will users know that? Is CommonCrawl liable for offering such a dataset even though CommonCrawl cannot police how the data will be used? What, if anything, can a user know about the rules around CommonCrawl data going forward? These questions and more will plague AI research if restrictive "do not train" terms of service persist and spread.
The move to prevent most, if not all, ability to scrape Web data to train software is a mistake. Publishers and authors have understandable concerns over whether their material is being used to generate works that compete with their human, creative outputs. Outputs are not inputs. Yet, rather than focus on outputs, the response to OpenAI's products has been to limit access to inputs. The NY Times, CNN, Reuters, and other major news organizations are blocking, at least, OpenAI's scraping tool, with the NY Times additionally blocking CommonCrawl.150 The BBC is exploring how to use generative AI systems in its reporting, and it is blocking LLM companies from crawling BBC sites to fuel those companies' software development.151 The BBC's position is consistent with the emerging practice of large entities using their material to build an LLM internally.152 The choice is ironic. The technology behind using one's own data to have an internal LLM relied on research that needed to access the very sites that are now removing access. In addition, in the age of data journalism, the broad bans on access to site data are either short-sighted or hypocritical.
The definition of prohibited training can be broad enough to be absurd. Training is such a fundamental term that it includes many ways almost any machine learning model would work. The NY Times, a leader in data journalism, offers a good example of over-reaching terms of service. The NY Times data journalism shows the irony of the paper's terms of service. The NY Times uses data mining and generative AI for the paper's reporting.153 For example, the NY Times published an investigative report about lies around the 2020 election.154 The paper had obtained "more than 400 hours of video recordings" of key groups organized with the goal of keeping the lies active.155 The paper used LLMs to transcribe the recordings and generate nearly five million words of text.156 The reporters still needed to go through millions of words, and used LLMs which "allowed [the reporters] to search the transcripts for topics of interest, look for notable guests and identify recurring themes."157 The NY Times "obtained" the videos158 and more, which suggests good reporting with access to material that is not fully public. But suppose the videos were online and had limits on access and how one could use them. Would the NY Times have obeyed? One need only turn the investigative lens on to the NY Times to see the issue with restrictions on access and use of data.
Suppose you wanted to study the NY Times to detect bias in reporting. You may think the paper is too liberal, or racist, or favors a certain political view, or disfavors a view in subtle ways.159 The language in reporting would be a way to examine and support your hypothesis. To test this hypothesis, you may want to train a classifier model or a language model just on NY Times articles and perform statistical analyses. Under the terms of service, however, you would need the NY Times' written consent to access the site.160 You would also need permission to use any automated software to gather data from the site.161 Even if you had that data, you could not "use the Content for the development of any software program . . . including, but not limited to, training . . . a machine learning or artificial intelligence (AI) system."162 The words, the development of any software, encompass writing script to analyze data. The Terms also prohibit caching and archiving the data.163 Thus, if someone wanted to show how they arrived at their conclusions and that the analysis was valid, that person would fail. Without the data, a third party cannot verify results.164 These restrictions beg the question, would the NY Times really obey its own Terms of Service? Perhaps not.
The NY Times has embraced data journalism and analysis especially through its UPSHOT section, and that work further shows why restrictive terms of service make little sense. The UPSHOT prides itself on analyzing data as part of its reporting. It describes itself as "[a]nalysis that explains politics, policy and everyday life, with an emphasis on data and charts."165 Part of its mission is "unearthing data sets-and analysing existing ones-in ways that illuminate and, yes, explain the news.166 One of the Section's Top 10 stories of 2023 was Nate Cohn's piece, 6 Kinds of Republican Voters. 167 As he said, "the first output of [the] statistical analysis - called a cluster analysis - . . . didn't make sense to me [], at first. It took a while to make sense of how my algorithm was segmenting Republicans.168 In the article, one group, Libertarian Conservatives, was perplexing because it was "near the middle of the pack on almost every set of issues."169 Yet his algorithm "set them apart."170 The words "first output" and "cluster analysis" are about algorithms trained on data.171 Indeed, Cohn and his team had to analyze the algorithm and perhaps rerun the data to generate new outputs beyond the first output to make sense of the data and what it seemed to show.172 That work is software development and training; Cohn used a NY Times survey and so developed his data in-house.173 But what if a good data journalist gathered data to analyze it? This case is what the Times did with another piece.
The UPSHOT used external data and analyzed it for an article examining "How Formulaic" Hallmark and Lifetime movies are, "[b]ased on data from IMDb, internet recaps and our own viewing."174 IMDb's Terms of Service include language requiring written permission for any commercial purpose.175 The reporters used IMDb's API to access the metadata.176 Whether the NY Times paid the $150,000 plus metered costs for basic API access or used the 1-month free trial is unclear.177 Nonetheless, the NY Times seems to have had permission to gather and analyze the data. But "Internet recaps" likely come from a range of websites that may not be happy to have their content scraped or gathered in some other way. Whether the reporters checked those sites' Terms of Service statements is unknown. Even if the sites did not have Terms of Service statements, the ethos of not using another site's data for your own purposes would seem to indicate that the reporters ought not to have used those recaps or, perhaps, asked permission as a courtesy. Even if all the data were gathered with permission, did the reporters let sources know they would use software to analyze the data? What if the reporters wrote a little bit of code to do that? As with Mr. Cohn's work, that is "development of software."178 The code might be useful for future stories and generating new insights, which is yet another way that the work is development.
In general, permission-driven systems for Internet-available material raise many problematic questions. Must the NY Times, other news agencies, NGOs, and other civic groups ask for permission when they gather data in reporting or other similar attempts to use the Internet to study society? Will they need permission to build software to analyze that data? Will that permission allow further use of the data and software, or will the permission be limited and short-term? What about vital advances in AI related to vision research or cybersecurity? Will they be encumbered if they used the Internet as part of their results? What about academic research? Will academics need to clear all material for research in language, vision, robotics, and other AI research? Will that research also have limits on further use? These questions are maddening and a clearance nightmare, but they follow from restrictive legal terms. Indeed, the history of commercial entities engaging in copyright licensing and clearances shows the problems of giving copyright holders too much power.
3. We Should Not Repeat Mistakes of Copyright Licensing History A future where a copyright holder could argue that data were allowed for one type of training or software but not another would lead to a ball of confusion and litigation.179 Consider the issues around music use in film and television. A show can be inaccessible if a copyright holder allowed use of a song for TV broadcast, but not on a video format (VHS, DVD, Blu-ray, streaming, etc.).180 For example, lack of clearance rights for music used in visual media has stopped access to important work on civil rights and limited the availability of films and TV shows on streaming services.181 Recent works suffer from copyright strangleholds too.182 Put simply, expanding copyright to cover use of copyrighted material for training would exacerbate the anticommons problem that Michael Heller and Rebecca Eisenberg identified more than twenty-five years ago.183 Put simply, increased copyright power creates a Gridlock Economy-one where property owners act as Robber Barons overcharging for access to the commons.184
Expanding copyright to include software training would also ignore what Jerome Reichman and Ruth Okediji figured out more than ten years ago. Copyright laws already pose a threat to scientific research, especially research that uses digital techniques such as "data-mining techniques, and other automated knowledge discovery tools."185 Allowing the copyright industry to control inputs to, and dictate terms of use for, software development would make the industry gatekeepers of future uses of software. It would also offer too much power to an industry that already overreaches with over-long copyright terms and extraordinary barriers to access to knowledge.
Furthermore, the shift by some LLM companies to license rights to train their software186 reveals problems with deferring to the copyright industry. The move does not address whether someone should be required to do so. Indeed, fair use law maintains that licenses are not required when the use is fair use.187 Yet some groups argue one must license if one can.188
In addition, given the recent concerns over antitrust and technology in general, deference to copyright power and industry-as the Federal Trade Commission's (FTC) recent statements about AI training and copyright embrace-is a mistake.189 The FTC's fears about "training an AI tool on protected expression without the creator's consent"190 are myopic and simply miss the abuse and anti-competitive nature of the copyright industry. The statement assumes authors and artists will be compensated, but the data are concentrated in large publishers. Moreover, many actual creators are skeptical of the current licensing deals because of where the money may go and for fear that the deals will foster software that might reduce the need for the writers who created the material in the first place.191 Furthermore, the FTC's empty statement about compensating creators ignores the reality of the market. If one moves to require licensing, the amount of data required necessitates millions if not billions of dollars.192 As one researcher noted, an entity that can "lock up" content will do so, but that creates a paradox.193 Once the lock is in place, early movers who grabbed data are "bless[ed]," because "the ladder [is pulled up and] nobody else can get access to data to catch up."194 In short, assuming and/or agreeing that all copyrighted data needs a license to train models fosters the exact outcome that the FTC claims not to want: a limited and anti-competitive market that also probably does not pay authors. Last, the view misses the issues around fair use that arguably thwart the anti-competitive power of copyright. Well-funded companies can afford to experiment with a mix of licensing and fair use arguments and litigation; other actors cannot.
Even if the fair use doctrine protects academic research and non-expressive uses of copyrighted material, a custom of "do not crawl" code and restrictive terms of service creates moral and ethical mayhem for legitimate uses. More sites are signaling that they do not wish to be crawled and/or used for study and development of AI. Thus, researchers will be forced to ignore custom and perhaps transgress online ethical standards.
Evolving legal standards pose problems too. Some caselaw suggests failing to use metatags such as "no archive" creates an implied license.195 By extension, if one uses the "no archive" metatag, one denies the license. At least one court has held that implied licenses are particular; a license to use web content for one purpose does not give license for another purpose.196 Worse, potential legal action can stop research and innovation before they start.197 Fear of litigation could scare risk-averse research institutions into getting clearance before undertaking projects.198
The focus on a sliver of reckless LLM outputs of the industry is myopic and misses the broader ways that LLMs offer advances in AI research and business productivity. Paragraphs, book chapters, essays, and images that may violate copyright are a subset of potential outputs from LLMs. Search, spell-check, auto-complete, customer support systems, analysis of legal and financial documents, and language translation software199 are but a sample of the sorts of systems that leverage LLMs. Fundamental breakthroughs in NLP, neural networks, self-improvement, randomness and creativity, computer vision, and other AI research areas also require access to the copyrighted material on the Web.200 In short, more data and high-quality data are important for future advances and future applications of LLMs and AI research in general.
Ex-ante restrictions on access to copyrighted material for purposes that are inexpensive and non-infringing will recreate problems that unduly burden the use of copyrighted material. X, formerly Twitter, already offers an approach of rather useless or high-cost API access. The $100 per month access provides only "0.3 percent of [the data researchers] previously had free access to in a single day."201 The enterprise access plan is apparently $42,000 per month and still does not allow enough access for some major studies and analytical tools.202 X's approach shows the way that access can be cut off and/or overpriced for just one centralized platform. Similarly to X, Getty Images, The NY Times, Penguin Random House, HarperCollins, Simon & Schuster, Hachette, and others could all demand payment for any access.203 Denying or charging exorbitant fees is not the only problem with ex-ante access. Once the copyright industry can exert power over access to copyrighted material for training purposes, it will likely add terms controlling what training is allowed (i.e., what happened with The NY Times' terms of service).
B. Data Outputs – Simpler Fixes
Past clashes between platform companies and the copyright industry show a simpler path forward; commercial companies must be more careful. Had OpenAI followed the customs and legal guidelines about online copyrighted material, the debate around copyrighted material as inputs may not have been resurrected. As mentioned previously, outputs matter. Indeed, at least two computer scientists were "surprise[ed]" that OpenAI has allowed issues around preventing its outputs from offering copyright-violating results "get this far."204 AI companies must test whether software generates outright IP-infringing outputs. AI companies must also use common filtering techniques to mitigate potential harm. Neither of these practices mean that a company can identify all the possible negative outcomes.205 But relying on ex-post fixes and/or generosity to identify harms, such as paywall bypassing, suggests that the company is either clueless about the classic issues between platform and content industry, or that the company does not care.206 The key question is: "Did a company at least try to mitigate harm?"207
A consistent rule in intellectual property rights versus platform companies is that platform companies should work to mitigate intellectual property harms. The Google Books case again offers instruction. Unlike OpenAI's approach, which yielded a full page or pages of a book, Google's Book Project used snippets of books and has done so for almost 20 years.208 Snippets are about three lines of text and Google placed guardrails on results so users could not prompt the system to reveal whole pages or even a string of snippets that would reveal a whole page.209 Instead of throwing items out the lab window onto the Web to see what would happen, OpenAI should have tested its software in-house and asked questions about what it might do. Indeed, any company should not only test its software but also ask, what this software is good for or supposed to do?210
Recent Internet technology research and product history illustrates what to do and what not to do. When Google built Street View, its mapping cars not only took pictures, but also gathered Wi-Fi data including user emails, passwords, and other information.211 Some of that data was "scooped up," even though those data were not needed for the product.212 Google's initial position was that information visible in public (including data that unsuspecting people did not encrypt), is by definition not private, and so could be gathered as part of the project.213 After a multi-state lawsuit, Google settled and admitted that its practices violated people's privacy.214 Google is not alone in its research errors. Facebook researchers published an infamous paper about how they could create social contagion online.215 The paper shows that newsfeeds can indeed affect how we feel. When there are fewer positive posts, people create more negative posts and fewer positive posts. Alternatively, when more negative posts are in someone's feed, the opposite occurs.216 Understanding this dynamic means one could better govern social media, but it also shows that social media companies could manipulate users. In addition, by conducting the study on 689,003 users, the researchers transgressed an ethical boundary.217 In contrast, although many people want to decry Amazon's in-house hiring tool that disfavored hiring women, Amazon tested the software and chose not to use it because of its flaws.218 Amazon's approach is the better one.
Insofar as OpenAI wanted to offer an accurate search tool service, the possibility that its software might produce problematic results should not have surprised the company. Once the science went from building software to show general language prowess to providing accurate answers, which requires memorization, the outputs were more likely to infringe or be close to infringing intellectual property law. Prompts such as "write in the style of X," or "summarize Y book by Z author," have yielded results that are the basis of some of the current lawsuits.219 In one case, rather than limiting outputs, OpenAI allowed a user to ask for a summary of several chapters of an individual author's book because the system could be prompted to summarize Chapter 1, Chapter 2, and so on.220 Such results would be analyzed under the third factor of fair use analysis, which assesses the amount or substantiality of the portion used. Although summaries may be legal, summarizing each chapter in detail could likely be seen as using too much of the original work and thus count against finding the use fair.
To be clear, LLM services are not like Napster or Grokster where entire songs were shared. But, without better guardrails, the services give the impression of widespread copyright output problems. As two computer scientists put it, "[F]rom a practical perspective, the idea of people turning to chatbots to bypass paywalls seems highly implausible, especially considering that it often requires repeatedly prompting the bot to continue generating paragraph by paragraph. There are countless tools to bypass paywalls that are more straightforward."221 Nonetheless, by not thinking about how prompts might output the entire text of a book like Oh, the Places You'll Go! or the first 3 pages of Harry Potter and the Sorcerer's Stone, 222 OpenAI was, at best, cavalier about its outputs, and possibly reckless at worst. 223
Any company with an online presence must understand the issues around online copyright and should have known that users are likely to use any tool possible to gain access to copyrighted works, especially ones that claim to be quite concerned about the misuse of AI. That issue has been at the core of Internet law since the birth of the commercial Internet. Indeed, two laws, Section 230 of the Communications Decency Act (Section 230),224 and the Digital Millenium Copyright Act (DMCA)225 are well-known laws born of that tension and have been around since 1996 and 1998 respectively. Yet the lessons learned from commercial practices since those laws' advent evaded OpenAI's management.
The two acts have flaws, but, in a way, they balance each other. Section 230 offers platforms, such as YouTube, protection for hosting third party content that could infringe IP rights. But the DMCA requires platforms to take reasonable steps to delete or prevent access to infringing copyrighted material once the platform has notice from the copyright holder.226 As a practical matter, these two rules clash or, at least, might look absurd together. For example, in Viacom Int'l Inc. v. YouTube, Inc., the sheer number of potential infringements on YouTube meant the film, television, music and sports industries faced nearly 100,000 violations that were known to YouTube. 227 The music industry sued, and part of its argument was that Congress could not have meant to offer a solution where, over time, a huge volume of unauthorized copies of music were on YouTube.228 To defend themselves, YouTube could have stuck to Section 230 protection, which had been upheld in cases with similar issues around intellectual property.229 Yet, rather than rely solely on the law, YouTube built Content ID, which aids music publishers in identifying potentially infringing material. The publishers can decide whether to send a notice and takedown letter or monetize a use of copyrighted material that would otherwise be difficult to negotiate.230 That investment meant Viacom limited their suit to acts prior to Content ID's implementation.231
Online trademark rights offer another example of better management.232 The online marketplace, eBay, hosts many sellers who are difficult to police and sometimes sell counterfeit goods.233 For instance, Tiffany sued eBay and argued that eBay knew of the counterfeits and benefitted from those sales.234 This argument, however, failed because of eBay's proactive steps in building a system that let sellers provide eBay with evidence of counterfeit goods so that eBay could shut down such sellers. In addition, eBay spent close to $20 million a year on trust and safety, which included having nearly 4,000 people on the trust and safety team, 70 of whom focused on counterfeiting issues.235 Arguing that safe harbors do not apply because OpenAI is not hosting third-party content misses the point.236 eBay was not protected by the safe harbors either-it simply chose to try to mitigate harm to start.
The core lesson from the eBay case is the more a platform or service mitigates harm, the better. Even with smarter business practices in place, issues around access to data will persist. Future access to copyrighted inputs will mean a shift in research practices and new approaches to balancing access to research inputs and the possible need to compensate some, but not all, copyright holders.
V. THE PATH AHEAD
The reality is that LLMs, such as ChatGPT, need not be a threat to copyright-based industries-yet lawsuits and the outcry to be paid persist.237 Researchers and lawsuits have probed ChatGPT in aggressive ways (that are properly characterized as attacks) and found troublesome outputs. As the public found out about these edge cases, even the FTC took notice with some ill-advised statements about "generative AI."238 Public scrutiny, protest, and the threat of lawsuits has forced OpenAI and, by extension, other so-called "generative AI" companies to build guardrails. These guardrails include limiting the sorts of prompts a system will follow, fine tuning to prevent verbatim outputs, and output filtering.239 Why, then, does the drumbeat for compensation and over-reaching statements from the FTC about "training an AI tool on protected expression without the creator's consent" continue?240 Part of the furor is wrapped up in issues far larger than copyright compensation, such as the future of work, support for news reporting,241 misinformation, etc.242 But make no mistake, part of the argument from the copyright industry is its usual argument that all uses of copyrighted material must be compensated.243 In essence, the argument echoes the Lockean approach to copyright, where the labor behind creating copyrighted material merits treating the material the same way as real property.244 That view further holds that any uncompensated use is free riding-a logic that has been mainly rejected.245 Nonetheless, whether access to illegal copies of books will be allowed as part of fair use analysis is an open question. Furthermore, as Pamela Samuelson, Christopher Jon Sprigman, and Matthew Sag note, failure to respect robots.txt or other "mechanisms" designed to stop scraping for training purposes might count against fair use.246 Thus, exactly how researchers should proceed regarding Internet content is also unresolved. This section offers solutions for research access to books and then for internet content.
A. Maintaining and Expanding Access to Books
Large book corpora are vital to LLM research, but access to copies of books for legitimate software training is scarce. One can reject the idea that all uses and access to books must be with permission and still recognize that using "shadow libraries" with 200,000 to 300,000 books raises problems. Such illicit access may not be fair use.247 As a matter of ethics and the optics of fairness, the outrage and uncertainty over the use of shadow libraries is prominent regardless of whether the researchers are academics or non-academic. The issue needs to be resolved, and we believe there are a few possible solutions for using book corpora going forward.
Fair use presents possible solutions, but is unlikely to solve all the issues. Courts could recognize using such corpora as fair use for all software training.248 This approach would track the idea that one does not need permission to use copyrighted material when the use is fair. This approach would allow a broad rule that avoids the myriad of issues that will inevitably follow if the copyright industry is able to exert ex ante control via licensing and restrictions on types of use. It would not, however, bless all outputs flowing from the software training. Thus, the copyright industry could vindicate rights when outputs that violate copyright law are created. As we have explained, however, the scale and nature of access via shadow libraries means this outcome is not obvious or guaranteed.
Courts may also hold that one must have legitimate access to the underlying work even when the use is fair. In this approach, one does not need permission. Instead, as with 2 Live Crew and other cases where permission was not required, one would have to have legal access to the material.249 If one wanted to use digital books, which might carry terms of service about licensing and restrictions on use, courts should allow someone to buy the copy and then ignore such terms. As such, courts would establish that evading digital copy controls, including using tools to circumvent Digital Rights Management controls, is allowed for such training as part of fair use doctrine.250 For physical books, researchers could buy and scan physical copies of the books. This idea would be in line with what the Google Books Project did. Buying and scanning may be viable for a commercial company; it is not viable for academic research. Moreover, older books that are not easily purchased increase the problem of access to the data in them.
Furthermore, having many people buy and scan books, including hunting down older copies of books to create their own library, is inefficient. Lastly, there is the problem of security. If the digital library is not secured, it could end up being shared and spread across the Internet in the same way data breaches spread other valuable data.251 Thus, we propose a radical solution.
Society should create a centralized, trusted source for access to book data. The best way to accomplish this task is to have the libraries that have contributed to the Google Books Project Corpus (GBPC) or the HathiTrust expand access to their data as a resource for academic research LLM training.252 Although the HathiTrust is a potentially viable option, we use the GBPC as an exemplar of how such a solution could work. As of January 27, 2023, the GBPC has "more than 40 million books in more than 500 languages."253 Thus, it dwarfs the shadow libraries' mere 290,000 titles and has the blessing of the law as fair use for training and non-expressive uses. Access to the GBPC could be at a price per book, in which case the price should not be the full $20 per book. The idea is not to make this approach into a cash font for the libraries and Google. Instead, the goal is to have a reasonable rate so research can be done in many different countries, given the breadth of languages in the GBPC. For example, if someone wanted a corpus of 300,000 books, at a rate of $0.1 per book, it would cost them $30,000. Half the revenue could go to libraries and Google (given that Google is likely the best placed to host and manage the program). The other half could go to a pool that the Copyright Office could distribute to publishers to give to authors on a pro rata basis. We acknowledge that authors may be quite upset at the small amount of money they may receive. This situation is not, however, the same as streaming music, where each time a song is played is a direct relationship to the service. Indeed, the question of how much a particular work or author contributes to an LLM output will be a challenge for plaintiffs in current lawsuits and for any payment based on amount used in an output. Even if a book's copyright is somehow infringed, a given author is not likely to be more important than another in a 300,000-book corpus, let alone a larger one that may be 1 million or more titles.254 Although some research or services may need access to the full book, there are situations where using less than a full book would work quite well. That possibility presents another option that may further protect against copyright output issues.
Because certain attacks on LLMs yield troubling outputs of full chapters or other substantial amounts of a work, researchers may forgo using entire books and instead use smaller, diced up parts of books to see whether their model has learned language.255 In this case, the repository might charge $0.01 per page. Ironically, the pressure on outputs and desire to limit memorization issues may lead to fewer parts of books being used and so fewer royalties, if any, going forward. Nonetheless, limiting full copies of a text addresses some problems with outputs and lowers the cost of research.
Using libraries' books that are part of the GBPC as a shared repository for research has added advantages. Given that security of the book data is an issue, a system that gives the book data to one entity without protecting the data is undesirable. Courts have looked favorably on Google's security for its book data.256 Google has little interest in reducing that security and is likely better placed than most entities to keep the data secure simply by keeping its practices intact. Another advantage is that one does not have to give a researcher the underlying data to use the GBPC to train the model.
Although one might think a researcher needs a copy of the data to train their model, there are ways to let the data stay at a cloud-based service, which can provide all the security benefits while maintaining the quality of training and research. As we have offered the GBPC as a possible repository, we will stay with Google as an example of a cloud-based service. There are two components needed to train a model: the code to create and train a model, and the training data. Technically, one can send the code to construct and train a model on Google servers, where book corpus data already reside. While the model trains, the external developer pays for access to the data and for the cost of using the computing resources. In this way, Google's book data never need to leave Google's computers, reducing the risk of being copied. Once complete, the model can be downloaded, or even hosted permanently on Google's servers through APIs, like the way OpenAI hosts ChatGPT, GPT-3, and GPT-4. Such a service could be set up to allow external developers to specify subsets of the book corpus. For example, a developer could use 100,000 Yiddish titles, which will establish the payment rate. The service could also be configured to allow external developers to upload and mix in their own data. If there is an issue with the model, they can correct the code or change the data and retrain the model. This alternative is similar to how Google's current AutoML service already operates: AutoML allows users to specify the kind of model they want and trains it completely within the confines of Google Cloud.257 It is also like Google Colaboratory, which allows users to write arbitrary code and run it on Google's unused infrastructure. The one exception is that training the AI would result in secret data that could not be downloaded.258
Of course, the law cannot mandate libraries and Google to offer such a service. Yet, insofar as libraries have the underlying books, the libraries may be able to convince Google to aid in that offering. As a sign of true ethical behavior, Congress could pass a law with mandated fees as we describe above. This law may convince Google to manage the service as a public-private partnership. Libraries could also see if any competitor of Google wanted to host such a service.259 As this option may not come to fruition, we offer a simpler one.
Given that LLMs will influence how people write and work in general and that LLMs need good inputs going forward, there could be a royalty on books sales analogous to the one in the Audio Home Recording Act (AHRA) for digital media. The music industry feared digital home recording media, such as Digital Audio Tape (DAT) and recordable Compact Discs, would displace income from sales of vinyl and audio cassettes because people could buy the analog copy and then make digital ones with almost no loss in sound quality.260 To address this possibility, a royalty was placed on recordable media such as digital audio tapes and recordable CDs.261 This fee goes to the Copyright Office, which then works with music publishers to distribute the royalties.262 By analogy, part of an argument against allowing LLMs to use books is that LLMs may displace the incentives and incomes for authors and undermine the market for the high-quality writing that LLMs need and society wants.263 Although the reality of many IP markets, especially books, is that they are winner-take-all markets, where only a small number of authors earn enough to make a living,264 the AHRA approach offers an acknowledgment that writing matters. With around 767 million print books sold in 2023265 and 191 million ebooks sold in 2020,266 even a $0.50 per unit fee would generate about $479 million a year; and $1.00 would generate $958 million a year. This fee, combined with better control over outputs, would acknowledge the importance of authorship even as authors start to embrace using LLMs to aid in writing their new works. Because access to books for research would still be cost-prohibitive in this option, the law would have to allow researchers to use books, even those in shadow libraries, under fair use (assuming the outputs do not infringe copyright).267
B. Maintaining An Open Internet
Even if we solve access to book data, access to Internet text via CommonCrawl and other sources is vital for future LLM research. Internet data is larger than book data, likely covers a broader range of information and views, and is up to date regarding recent events. It also reflects, for better or for worse, how people understand history or a particular issue at a given time. Moves to close off access to data, employ restrictive terms of service, and/or use signals in code indicating "no training," "no CommonCrawl," or other limitations on access or use threaten our ability to have necessary data. These limitations threaten to change the Open Internet into the sort of permission-driven situation that the early Internet sought to avoid.
Although there are research possibilities with older Internet data, the possibilities are limited and decay over time. If one wants to build an LLM model for pure language emulation, CommonCrawl from before the release of ChatGPT and its public API should be sufficient data.268 As long as CommonCrawl pre-ChatGPT API is available, NLP research should be viable. It may lose out on some evolution of dialects or sub-dialects that emerge in future, but the dataset should be sufficient for true NLP research, as opposed to research aimed at search and other knowledge commercial products, for the near future. Of course, as language and knowledge evolve, a static dataset from 2022 will become stale and less useful. As a simple example, if the data collection were cut off before a major event, language on the Internet from before that event simply could not capture how we talk about that event. For instance, imagine a dataset cut off before the summer when George Floyd was killed. A host of things, such as Black Lives Matter, antiracism, and being woke, came into our vocabulary. No matter your view of these words, one cannot deny they added nuance to the language that was previously unseen, and sparked counterviews and related nuances to language, too. To use another example, if the dataset was cutoff before former President Trump's first run for office-regardless of one's views on the politics of his time in office-the algorithm would miss out on all the ways language changed because of that moment in history. The issue is not about whether someone likes or dislikes these events and how people speak of them. The issue is ensuring an LLM can reflect reality in its dataset. In that sense, this problem is more acute if the goal is to answer questions and stay up to date with results.
Once one moves beyond language emulation to something closer to a machine that can answer all manner of questions, access to current Internet data is vital.269 CommonCrawl and similar datasets become not only language tools but also knowledge sources. Just as cases have held that search is a transformative use of copyrighted content,270 courts should hold that LLMs and other AI software may use such content on the Web when the use falls under fair use. The problem today is that thin caselaw and changing Internet etiquette and ethos cloud the issue.
If courts hold that ignoring restrictive terms of service or coded signals such as robots.txt, no crawl, and no training is prohibited, the nature of how we navigate the Internet and conduct research in AI will be changed for the worse. Legal scholars have demurred on whether honoring robots.txt should be part of fair use analysis,271 noted that "disregard of robots.txt and similar mechanisms" could cut against the fourth fair use factor,272 and suggested that "respect for technological and contractual opt-outs" should matter more for commercial than noncommercial research.273 We think a stronger case is possible. Drawing on the fair use doctrine, unlike books, Internet data is often freely accessible. Therefore, using data without permission is much closer, if not the same as, when 2 Live Crew used Roy Orbison's lyrics and music to create their song.274 Furthermore, if one believes people should have access to shadow libraries, which encompass work not intended for the open Internet, one should also believe people should have access to a class of copyrighted material, which are by design open to access and interaction, regardless of signals such as "no CommonCrawl" and "robots.txt." Finally, even if something is behind a paywall, if the person accessing the data paid for it and then used the data for data mining, software training, etc., this use tracks with the reverse engineering cases. Thus, it should be allowed as fair use.
VI. CONCLUSION
Copyright and computer science continue to clash, but they can also coexist. The advent of new technologies, such as digitization of visual and aural creations, sharing technologies, search engines, social media offerings, and more, challenge copyright-based industries and reopen questions about the reach of copyright law. Breakthroughs in artificial intelligence research, especially Large Language Models that leverage copyrighted material as part of training models, are the latest examples of the ongoing tension between copyright and computer science. The exuberance, rush-to-market, and edge problem cases created by a few misguided companies now raise challenges to core legal doctrines and may shift Open Internet practices for the worse. That result does not have to be-and should not be-the outcome. Solutions require changes from computer scientists and the copyright industry.
Computer scientists, especially those working on AI, must be careful about how their work fits into society. The quest for AGI combined with claims about safe artificial intelligence research can delude researchers into believing that all actions in pursuit of AGI are justifiable. Believing that what is proper for academic research is always proper in commercial settings is another delusion. Abandoning such perspectives and instead asking, "How can we mitigate potential harms?" is a better approach. Thus, this Article presented ways to reduce the amount of a text used, examine whether access to the data was legitimate, test outputs to detect and filter infringements on intellectual property, and other steps to address problems with commercial uses of LLMs. In addition, the Article has noted that we are not at the dawn of the Internet. Instead, an evolution and equilibrium of law balancing copyright and computer science has occurred. A body of case law shows that when companies take serious steps to show respect for IP and invest in efforts to reduce IP harms, courts look favorably upon such steps. In contrast, when companies used society as a playground or lab, society and the law have made it clear that such approaches will not be tolerated. AI-based companies should heed caselaw and norms rather than ignore them. The copyright industry is not, however, angelic either.
The copyright industry must temper its usual instinct and belief that all uses of copyrighted material should be compensated. The industry is probably correct that using hundreds of thousands of unauthorized copies of books is unethical. As we have argued, the scale of use challenges fair use doctrine and indicates such use is not fair. Yet the drive to reject the rule that one can use copyrighted material for non-expressive, non-infringing purposes is misguided. A large problem is that commercial research can afford to buy copies of books, but this solution is not viable for academic research. Furthermore, restricting access to books increases the probability of under-representation and biased data, which can lead to tainted outputs. The same argument applies to Internet material. The recent shift to using signals to prevent web-scraping for training data presents legal and ethical challenges.
As such, this Article has offered several ways to increase access to books and Internet data. For books, we have made the case that libraries that are part of the Google Books Project or HathiTrust could make their repository, which has data in a secure environment, available for research. This method would allow researchers to access and use that data for their research, and it would solve the issue of whether the book corpus was built fairly. The service would charge a fee: half of that fee would go to the Copyright Office to distribute to authors, while the other half would go to the server hosts. As an alternative, we have offered ways to compensate authors through a media subsidy, a method that the music industry supported in the past. In this alternative, the copyright industry would have to concede that access to "shadow libraries" is fair use. Similarly, if the industry wants researchers to pay for a digital copy of the book, the law should change to allow researchers to use tools to circumvent digital rights management software. For Internet material, the law should allow researchers to use openly available material regardless of code or terms of service indicating that such use is not allowed. These changes are needed because, as copyright's history demonstrates, narrowly prescribed uses of copyrighted material will only lead to bizarre clouds on the title to datasets and to ongoing litigation over whether future software built with copyrighted material is authorized or prohibited by contract terms.
In short, this Article has traced the technical realities of AI research, the differences between academic and commercial research, and the ways AI research can improve its practices. This Article also offered technical, compensatory, and legal solutions to the problems of LLMs, and the questions raised about building LLMs. By balancing the interests of the various stakeholders and disregarding overly broad claims from them, this Article demonstrates how all parties can behave better and compromise so that law and policy can re-establish a healthy balance between copyright and computer science.
* Sue and John Staton Professor of Business Law and Ethics, Georgia Institute of Technology, Scheller College of Business, Associate Director, Law, Policy, and Ethics Georgia Tech Machine Learning Center; J.D., Yale Law School; Affiliated Fellow, Yale Law Information Society Project; former Academic Research Counsel, Google, Inc. This Article was supported in part by summer research funding from the Scheller College of Business. We are grateful for the feedback from participants at the Legal Scholars Roundtable on Artificial Intelligence hosted by Emory Law School.
** Professor Georgia Institute of Technology, College of Computing, School of Interactive Computing and Associate Director of the Georgia Tech Machine Learning Center; PhD in Computer Science, North Carolina State University.
The authors thank the Northwestern Journal of Technology and Intellectual Property for its excellent editorial insights and collaborative culture.
1 See, e.g., James Grimmelmann, Copyright for Literate Robots, 101 IOWA L. REV. 657, 657 (2015); Benjamin L. W. Sobel, Artificial Intelligence's Fair Use Crisis, 41 COLUM. J. L. & ARTS 45, 47. (2017); Peter Henderson et al., Foundation Models and Fair Use 1 (Mar. 28, 2023) (arXiv preprint) (available at https://arxiv.org/pdf/2303.15715 [https://perma.cc/2WTR-CMSF]).
2 See, e.g., Aron Kantor, Top 12 OpenAI Competitors & Alternatives Revealed (2024), THE BUSINESS DRIVE (July 21, 2023), https://thebusinessdive.com/openai-competitors#openais-strongest-competitors [https://perma.cc/3CLN-PLCF].
3 Matthew Sag, Fairness and Fair Use in Generative AI, 92 FORDHAM L. REV. 1887, 1900 (2024).
4 See Henderson et al., supra note 1, at 3.
5 See infra Part III.A.2 (explaining the shift to new signals to prevent scraping the Internet and training software based on the data gathered); cf. Pamela Samuelson, Generative AI Meets Copyright, 381 SCI. 158, 159 (2023) (arguing use of "a dataset consisting of 5.85 billion hyperlinks that pair images and text descriptions from the open internet" was likely lawful, given the dataset was made by a European non-profit and protected by EU law on using copyrighted material for text and data mining).
6 See Sara Morrison, What Microsoft Gets by Betting on the Maker of ChatGPT, VOX (Jan. 23, 2023, 2:10 PM), https://www.vox.com/recode/2023/1/23/23567991/microsoft-open-ai-investment-chatgpt [https://perma.cc/63YA-G728].
7 See Mark Riedl, Toward AGI – What Is Missing?, MEDIUM (Aug. 3, 2023), https://mark-riedl.medium.com/toward-agi-what-is-missing-c2f0d878471a [https://perma.cc/HNL6-F6V5]; Sean Illing, Stuart Russell Wrote the Textbook on AI Safety. He Explains How to Keep It from Spiraling out of Control., VOX (Sept. 20, 2023, 7:00 AM), https://www.vox.com/the-gray-area/23873348/stuart-russell-artificial-intelligence-chatgpt-the-gray-area [https://perma.cc/872V-46AZ].
8 Kevin Roose, 'I Think We're Heading Toward the Best World Ever': An Interview with Sam Altman, N.Y. TIMES (Nov. 20, 2023), https://www.nytimes.com/2023/11/20/podcasts/hard-fork-sam-altman-transcript.html [https://perma.cc/C3JE-JGPV].
9 See, e.g., Dorothy Neufeld, Visualizing the Training Costs of AI Models over Time, VISUAL CAPITALIST (June 4, 2024), https://www.visualcapitalist.com/training-costs-of-ai-models-over-time/ [https://perma.cc/SBL2-K2HX] (noting costs to develop LLM models for a range of companies).
10 See, e.g., Anna Washenko, AI Startup Argues Scraping Every Song on the Internet Is 'Fair Use', ENGADGET (Aug. 1, 2024), https://www.engadget.com/ai/ai-startup-argues-scraping-every-song-on-the-internet-is-fair-use-233132459.html [https://perma.cc/MB4J-573X]; Daniel Tencer, As Suno and Udio Admit Training AI with Unlicensed Music, Record Industry Says: 'There's Nothing Fair About Stealing an Artist's Life's Work.', MUSIC BUS. WORLDWIDE (Aug. 5, 2024), https://www.musicbusinessworldwide.com/assunoandudioadmittrainingaiwithunlicensedmusic-recordindustrysaystheresnothingfairaboutstealinganartistslifeswork/ [https://perma.cc/KMS47CBU] (LLMs are also increasingly able to take images as input alongside text to fuel image generators. The technical and legal issues around images and generative systems are, however, different enough to be beyond the scope of this paper. So, too, for sound and music.).
11 Cf., Deven R. Desai & Joshua A. Kroll, Trust but Verify: A Guide to Algorithms and the Law, 31 HARV. J.L. & TECH. 1, 4 (2017); Ian Bogost, The Cathedral of Computation, THE ATLANTIC (Jan. 15, 2015), http://www.theatlantic.com/technology/archive/2015/01/the-cathedral-of-computation/384300/ [https://perma.cc/V468-JVNV] ("The next time you hear someone talking about algorithms, replace the term with 'God' and ask yourself if the meaning changes. Our supposedly algorithmic culture is not a material phenomenon so much as a devotional one.").
12 MELANIE MITCHELL, ARTIFICIAL INTELLIGENCE: A GUIDE FOR THINKING HUMANS 18 (2019).
13 See Exec. Order No. 14110, 88 Fed. Reg. 75191 (Oct. 30, 2023).
14 Ayanna Howard & Deven R. Desai, Taming AI's Can/Should Problem, MIT SLOAN MGMT. REV. (May 18, 2021), https://sloanreview.mit.edu/article/taming-ais-can-should-problem/ [https://perma.cc/9W27-NKDC].
15 See infra Part II.B (discussing increasing costs to build LLM models and the shift to commercial applications for such models).
16 See infra, Part III.A.1; cf. Anupam Chander & Madhavi Sunder, The Romance of the Public Domain, 92 CALIF. L. REV. 1331 (2004) (examining the potential harms from over-stated claims about what should be in the public domain).
17 See MICHAEL HELLER, GRIDLOCK ECONOMY: HOW TOO MUCH OWNERSHIP WRECKS MARKETS, STOPS INNOVATION, AND COSTS LIVES 2 (2008) (detailing the way in which too many and fragmented property rights create an anti-commons or wasteful underuse).
18 We thank Professor Matthew Sag for reminding us that the Hathi Trust offers possibilities similar to the Google Books Project.
19 Campbell v. Acuff-Rose Music, Inc., 510 U.S. 569, 585 n.18 (1994).
20 STUART RUSSELL & PETER NORVIG, ARTIFICIAL INTELLIGENCE: A MODERN APPROACH 26 (4th ed. 2021) (explaining the importance of "big data" and how it fuels modern AI research).
21 MITCHELL, supra note 12, at 18.
22 Id.
23 John McCarthy, et al., A Proposal for the Dartmouth Summer Research Project on Artificial Intelligence: August 31, 1955, 27 AI MAG. 12 (2006).
24 Id.
25 See e.g., RUSSELL & NORVIG, supra note 20, at 823–824 (noting that Alan Turing used language as his test for intelligence "because of its universal scope and because language captures so much of intelligent behavior"); accord JARED DIAMOND, THE RISE AND FALL OF THE THIRD CHIMPANZEE 124 (1991) ("After all, it is language that allows us to communicate with each other far more precisely than any animal can. Language enables us to formulate joint plans, to teach one another, and to learn from what others have experienced elsewhere or in the past.").
26 For example, GPT3 trained on 570GB of text. Tom B. Brown et al., Language Models Are Few-Shot Learners, 33 ADVANCES NEURAL INFO. PROCESSING SYS. 1877 (2020). Another resource for training LLMs, The Pile, is 825GB of text. THE PILE, https://pile.eleuther.ai [perma.cc/6MFX-JQ2K]. In terms of computing, Microsoft built a supercomputer with 10,000 GPUs for OpenAI. See Jennifer Langston, Microsoft Announces New Supercomputer, Lays Out Vision for Future AI Work, MICROSOFT (May 19, 2020), https://news.microsoft.com/source/features/innovation/openai-azure-supercomputer/ [perma.cc/RND2-79DV].
27 MITCHELL, supra note 12, at 180 (emphasis omitted).
28 Id.
29 Id. at 198.
30 ML researchers used to use "brittle" to attack the symbolic AI researchers. In theory, learning systems wouldn't be as brittle, but they were never good enough to really have to worry about it. Now, ML researchers are talking amongst themselves about how to overcome brittleness where "brittleness implies that the system functions well within some bounds and poorly outside of those bounds." Andrew J. Lohn, Estimating the Brittleness of AI: Safety Integrity Levels and the Need for Testing Out-of-Distribution Performance (Sept. 3, 2020) (preprint) (on file with arXiv) https://arxiv.org/pdf/2009.00802 [https://perma.cc/M56U-MZUF].
31 Id.
32 Id.
33 MITCHELL, supra note 12, at 181 (describing a system that captures language usage well but without "any understanding of the meaning").
34 C.E. Shannon, A Mathematical Theory of Communication, 27 BELL SYS. TECH. J. 379 (1948). Shannon's insights trace to Markov's work on n-grams being used to predict the next letter in a work by Pushkin. DANIEL JURAFSKY & JAMES H. MARTIN, SPEECH AND LANGUAGE PROCESSING: AN INTRODUCTION TO NATURAL LANGUAGE PROCESSING, COMPUTATIONAL LINGUISTICS, AND SPEECH RECOGNITION WITH LANGUAGE MODELS 52 (2024).
35 Dan L. Burk, The Problem of Process in Biotechnology, 43 HOUS. L. REV. 561, 584 (2006).
36 See JURAFSKY & MARTIN, supra note 34, at 34 (explaining n-grams); RUSSELL & NORVIG, supra note 20 (same).
37 See RUSSELL & NORVIG, supra note 20, at 851; Frederick Jelinek, Continuous Speech Recognition by Statistical Methods, 64 PROC. IEEE 532, 532–556 (1976) (introducing the concept of language models as "statistical methods of automatic recognition (transcription) of continuous speech" that provide probabilistic estimates for a given string of words).
38 Jelinek, supra note 37; JURAFSKY & MARTIN, supra note 34, at 53 (noting resurgence of n-gram research).
39 JURAFSKY & MARTIN, supra note 34, at 3–4 (explaining bigrams); RUSSELL & NORVIG, supra note 20, at 826.
40 The table doesn't need to be in any particular order. If you imagine a spreadsheet, all the words would be down the side. These are the words in position one. Then across the top you would have all the words again. These are the words that might show up in position 2. The cells would contain numbers between 0 and 1 indicating the probability that the word in the column follows the word on the row. For an example of a bigram probability table, see JURAFSKY & MARTIN, supra note 34, at 5–6.
41 Id. at 2–3 (explaining probability approach to NLP).
42 RUSSELL & NORVIG, supra note 20, at 832 (comparing how well different n-gram models work).
43 Id. (noting limits to n-gram models).
44 Ashish Vaswani et al., Attention Is All You Need, ADVANCES NEURAL INFO. PROCESSING SYS. (2017); see also RUSSEL & NORVIG, supra note 20, at 832–833 (noting the way transformers improve NLP results as compared to n-grams).
45 See infra notes 80–83 and accompanying text (discussing costs for LLM development).
46 JURAFSKY & MARTIN, supra note 34, at 17–23 (explaining tokenizing).
47 Vaswani et al., supra note 44, at 3; see also RUSSEL & NORVIG, supra note 20, at 868–71 (explaining self-attention on transformer architecture).
48 RUSSEL & NORVIG, supra note 20, at 868–71.
49 See Vaswani et al., supra note 44, at 3 (defining attention);see also RUSSEL & NORVIG, supra note 20, at 868–69 (setting out mathematical definition of attention).
50 See RUSSEL & NORVIG, supra note 20, at 865–66 (using "the front door is red" as an example of the phrase attention is generating).
51 Id. at 868–71 (explaining self-attention and how it works in a transformer).
52 See Dzmitry Bahdanau, et al., Neural Machine Translation by Jointly Learning to Align and Translate, PROCEEDINGS OF THE INTERNATIONAL CONFERENCE ON MACHINE LEARNING REPRESENTATIONS (2015) (the first use of attention to convert a learned representation of text into a probability distribution of values between 0 and 1 that are combined together to form an importance score); see also Yoon Kim et al. Structured Attention Networks, PROCEEDINGS OF THE INTERNATIONAL CONFERENCE ON LEARNING REPRESENTATIONS (2017) (elucidates that attention is the application of the softmax operation, which converts arbitrary vectors of numbers into a distribution between 0 and 1).
53 See Sucheng Ren et al., Shunted Self-Attention via Multi-Scale Token Aggregation, PROC. IEEE/CVF CONF. ON COMP. VISION AND PATTERN RECOGNITION, 10853, 10855 (2022) (explaining merging and tokens in the machine vision context).
54 This approach flows from Shannon Information Theory and the idea of surprise. See supra notes 34–35 and accompanying text.
55 Cf. Aparna Elangovan et al., Memorization vs. Generalization: Quantifying Data Leakage in NLP Performance Evaluation, 16 ASS'N FOR COMPUTATIONAL LOGISTICS, 1325 (2021) ("In the context of machine learning models, effectiveness is typically determined by the model's ability to both memorize and generalize.") (referencing Satrajit Chatterjee, Learning and Memorization, 80 PROC. MACH. LEARNING RSCH. 755 (2018)).
56 Id. ("The ability of a model to generalize relates to how well the model performs when it is applied on data that may be different from the data used to train the model.").
57 See Alan M. Turing, Computing Machinery and Intelligence, LIX MIND 433 (1950).
58 During the editing of this paper, the theoretical point we make regarding reducing the amount of data that are tokenized of a book, or by extension a given corpus, to avoid copyright infringement while still maintaining generalization performance was tested and appears successful. See Abhimanyu Hans et al., Be like a Goldfish, Don't Memorize! Mitigating Memorization in Generative LLMs, ARXIV (June 14, 2024), https://doi.org/10.48550/arXiv.2406.10209 [https://perma.cc/99QS-ZLTV] (finding one can randomly exclude data in the tokenization process and thus reduce verbatim and related copyright reproduction issues but maintain generalization performance).
59 Id.
60 Michael Tänzer et al., Memorisation Versus Generalisation in Pre-Trained Language Models, 1 PROC. 60TH ANN. MEETING ASS'N FOR COMPUTATIONAL LINGUISTICS 7564, 7564 (2022) ("[E]xcellent generalisation properties come at the cost of poor performance in few-shot scenarios with extreme class imbalances. Our experiments show that BERT is not able to learn from individual examples and may never predict a particular label until the number of training instances passes a critical threshold.").
61 Cf. Elangovan et al., supra note 55 ("When there is considerable overlap in the training and test data for a task, models that memorize more effectively than they generalize may benefit from the structure of the evaluation data, with their performance inflated relative to models that are more robust in generalization. However, such models may make poor quality predictions outside of the shared task setting.").
62 Id.
63 Vitaly Feldman and Chiyuan Zhang, What Neural Networks Memorize and Why: Discovering the Long Tail via Influence Estimation, 33 ADVANCES NEURAL INFO. PROCESSING SYS. 2881 (2020) (noting "natural image and data distributions are (informally) known to be long-tailed, that is[,] have a significant fraction of rare and atypical examples").
64 This issue may be more acute for image generators. Cf. Matthew Sag, Copyright Safety for Generative AI, 61 HOUS. L. REV. 295 (2023). If one wants to create an image with the Mona Lisa or another iconic image, the system likely needs to have the iconic image memorized to create the new image.
65 See Tänzer et al., supra note 60, at 7564 ("For many applications it is important for the model to generalise-to learn the common patterns in the task while discarding irrelevant noise and outliers. However, rejecting everything that occurs infrequently is not a reliable learning strategy and in many low-resource scenarios memorisation can be crucial to performing well on a task"); Cf. Elangovan et al., supra note 55 ("An effective combination of memorization and generalization can be achieved where a model selectively memorizes only those aspects or features that matter in solving a target objective given an input, allowing it to generalize better and to be less susceptible to noise.").
66 See, e.g., Alec Radford et al., Improving Language Understanding by Generative Pre-Training, OPENAI (June 11, 2018), https://openai.com/index/languageunsupervised/ [https://perma.cc/G8J55RMF] (asserting GPT-1 performance on "natural language inference, question answering, semantic similarity, and text classification . . . outperforms discriminatively trained models that employ architectures specifically crafted for each task, significantly improving upon the state of the art in 9 out of the 12 tasks studied.").
67 Id. ("For instance, we achieve absolute improvements of 8.9% on commonsense reasoning (Stories Cloze Test) [40], 5.7% on question answering (RACE) [30], 1.5% on textual entailment (MultiNLI) [66] and 5.5% on the recently introduced GLUE multi-task benchmark [64].").
68 Most LLMs have been fine-tuned to prefer to try to answer a prompt as if it is a question or a request for information. This is called "instruction tuning". Shengyu Zhang et al., Instruction Tuning for Large Language Models: A Survey, ARXIV (Mar. 14, 2024), https://arxiv.org/abs/2308.10792 [https://perma.cc/UKJ3-JNDP].
69 Even the founder of OpenAI, the organization behind ChatGPT, says that the term has no standard definition and, even if it did, ChatGPT is nowhere near a superhuman AGI level. See supra note 8.
70 See supra note 6.
71 Id.
72 Matteo Wong, The AI Search War Has Begun, THE ATLANTIC (July 30, 2024), https://www.theatlantic.com/technology/archive/2024/07/perplexity-ai-search-media-partners/679294/ [https://perma.cc/489F-HPJJ] (describing "AI-powered search bar").
73 Kylie Robison, OpenAI Announces SearchGPT, Its AI-Powered Search Engine, THE VERGE (July 25, 2024 1:00 PM), https://www.theverge.com/2024/7/25/24205701/openai-searchgpt-ai-search-engine-google-perplexity-rival [https://perma.cc/U9FM-TYKY].
74 Rebecca Bellan, Perplexity Details Plan to Share Ad Revenue with Outlets Cited by Its AI Chatbot, TECHCRUNCH (July 30, 2024), https://techcrunch.com/2024/07/30/perplexitys-plan-to-share-ad-revenue-with-outlets-cited-by-its-ai-chatbot/ [https://perma.cc/7HXJ-TS9L]
75 See Radford et al., supra note 66, at 4. Alec Radford et al., Language Models Are Unsupervised Multitask Learners, OPENAI (2019), https://cdn.openai.com/better-language-models/language_models_are_unsupervised_multitask_learners.pdf [https://perma.cc/3773-9AWH] ("The smallest model [summarized in Table 2 with 117M parameters] is equivalent to the original GPT.").
76 ChatGPT-4 is now at 1.5 trillion parameters. Chude Emmanuel, GPT-3.5 and GPT-4 Comparison: Exploring the Developments in AI-Language Models, MEDIUM (Aug. 3, 2023), https://medium.com/@chudeemmanuel3/gpt-3-5-and-gpt-4-comparison-47d837de2226 [https://perma.cc/547W-Z8ZL] ("With 1.5 trillion parameters compared to GPT-3.5's 175 billion, GPT-4 clearly outperforms GPT-3.5 in terms of model size and parameters."). When ChatGPT-1 was released in June of 2018, it had 117 million parameters. ChatGPT-2, released in February 2019, had 1.5 billion parameters based on 8 million pages of web text. ChatGPT-3, released in June 2020, had 175 billion parameters. The number of parameters for each release jumped by more than a factor of ten, which means the scale and complexity of the systems has grown at a tremendous rate. See Bernard Marr, A Short History of ChatGPT: How We Got Where We Are Today, FORBES (May 19, 2023 1:14 AM), https://www.forbes.com/sites/bernardmarr/2023/05/19/a-short-history-of-chatgpt-how-we-got-to-where-we-are-today/ [https://perma.cc/7BH3-3PJL]; see also Better Language Models and Their Implications, OPENAI (Feb. 14, 2019), https://openai.com/index/better-language-models/ [https://perma.cc/EXW2-GNWV].
77 Transformers are the core technology behind recent breakthroughs in LLMs. See Vaswani et al., supra note 44 and accompanying text.
78 For example, within NLP there are different, specific tasks with specific tests to see whether progress has been made. See, e.g., Radford et al., supra note 66, at 2 (asserting GPT-1 performance on "natural language inference, question answering, semantic similarity, and text classification . . . outperforms discriminatively trained models that employ architectures specifically crafted for each task, significantly improving upon the state of the art in 9 out of the 12 tasks studied. For instance, we achieve absolute improvements of 8.9% on commonsense reasoning (Stories Cloze Test), 5.7% on question answering (RACE), 1.5% on textual entailment (MultiNLI) and 5.5% on the recently introduced GLUE multi-task benchmark.").
79 This idea flows from Alan Turing's work. See Alan M. Turing, Computing Machinery and Intelligence, 59 MIND 433 (1950).
80 Guido Appenzeller et al., Navigating the High Cost of AI Compute, ANDREESSEN HOROWITZ (Apr. 27, 2023), https://a16z.com/navigating-the-high-cost-of-ai-compute/ [https://perma.cc/D2TE-GU2B] (estimating $560,000 for a single training run and noting "[m]ultiple runs will likely be required").
81 Id.
82 Cf. id. ("Training top-of-the-line models remains expensive, but within reach of a well-funded start-up. . . . [G]enerative AI requires massive investments in AI infrastructure today. There is no reason to believe that this will change in the near future. Training a model like GPT-3 is one of the most computationally intensive tasks mankind has ever undertaken."); see also Neufeld, supra note 9 (noting the foundational model-the transformer-cost $930 in 2017; GPT-3, $4.3 million in 2020; GPT-4, $78.4 million in 2023; and Gemini Ultra, $191.4 million in 2023).
83 Stephen Nellis, Microsoft to Invest $1 Billion in OpenAI, REUTERS (July 22, 2019 8:10 AM), https://www.reuters.com/article/us-microsoft-openai/microsoft-to-invest-1-billion-in-openai-idUSKCN1UH1H9/ [https://perma.cc/9J5M-QNYA].
84 See Marr, supra note 76; see also supra text accompanying note 76.
85 Radford et al., supra note 75.
86 Hasan Chowdhury, ChatGPT Cost a Fortune to Make with OpenAI's Losses Growing to $540 Million Last Year, Report Says, BUS. INSIDER (May 5, 2023 7:51 AM), https://www.businessinsider.com/openai-2022-losses-hit-540-million-as-chatgpt-costs-soared-2023-5 [https://perma.cc/8SMY-KYDV].
87 Nellis, supra note 83.
88 See Jonathan Vanian, OpenAI Will Need More Capital than Any Non-Profit Has Ever Raised, FORTUNE (Oct. 3, 2019 4:12 PM), https://fortune.com/2019/10/03/openai-will-need-more-capital-than-any-non-profit-has-ever-raised/ [https://perma.cc/DAJ7-WFLA] ("OpenAI CTO Brockman said that the group's plan to develop artificial general intelligence (AGI) is 'actually a really expensive endeavor' because of the enormous amount of computing resources required.").
89 Jordan Novet, Microsoft's $13 Billion Bet on OpenAI Carries Huge Potential Along with Plenty of Uncertainty, CNBC (Apr. 8, 2023 9:00 AM), https://www.cnbc.com/2023/04/08/microsofts-complex-bet-on-openai-brings-potential-and-uncertainty.html [https://perma.cc/PV7F-LBCE].
90 OpenAI Charter, OPENAI, https://openai.com/charter [https://perma.cc/NXS2-QMAW].
91 See Roose, supra note 8.
92 See Laura Bratton, OpenAI's Revenue Is Skyrocketing, QUARTZ (June 13, 2024), https://qz.com/sam-altman-openai-annualized-revenue-triples-skyrockets-1851538234 [https://perma.cc/PJS4-T8NS].
93 Harshita Mary Varghese, Anthropic Forecasts More than $850 Mln in Annualized Revenue Rate by 2024-End-Report, REUTERS (Dec. 26, 2023), https://www.reuters.com/technology/anthropic-forecasts-more-than-850-mln-annualized-revenue-rate-by-2024-end-report-2023-12-26/ [https://perma.cc/G5RC-5FDP] (noting investments by Amazon and Google and increased revenue projections for 2024).
94 Appenzeller et al., supra note 80 (noting extreme expense to train models but that such costs are manageable for a "well-funded start-up").
95 See, e.g., Winston Cho, New AI Lawsuit from Authors Against Anthropic Targets Growing Licensing Market for Copyrighted Content, THE HOLLYWOOD REP., (Aug. 20, 2024 12:23 PM), https://www.hollywoodreporter.com/business/business-news/new-lawsuit-authors-against-anthropic-targets-growing-licensing-market-copyrighted-content-1235979412/ [https://perma.cc/662T-6P4W] (reporting about a lawsuit against Anthropic, arguing that the company should have licensed book corpus material rather than using unauthorized material).
96 See Mark Graham, Robots.txt Meant for Search Engines Don't Work Well for Web Archives (Apr. 17, 2017), https://blog.archive.org/2017/04/17/robots-txt-meant-for-search-engines-dont-work-well-for-web-archives/ [https://perma.cc/3LJD-JBNJ]; Brad Jones, Internet Archive Will Ignore Robots.txt Files to Keep Historical Record Accurate, DIGITALTRENDS (Apr. 24, 2017), https://www.digitaltrends.com/computing/internet-archive-robots-txt/ [https://perma.cc/86RA-Q5JD].
97 See Sag, supra note 3, at 1917.
98 See id.
99 See David Pierce, The Text File That Runs the Internet, THE VERGE (Feb. 14, 2024 8:00 AM), https://www.theverge.com/24067997/robots-txt-ai-text-file-web-crawlers-spiders [https://perma.cc/GFD7-CYG2].
100 Id.
101 See supra notes 37–45 and accompanying text.
102 See, e.g., Winston Cho, Authors Guild Exploring Blanket License for Artificial Intelligence Companies, THE HOLLYWOOD REP. (Jan. 11, 2024 12:08 AM), https://www.hollywoodreporter.com/business/business-news/authors-guild-exploring-blanket-license-artificial-intelligence-companies-1235785941/ [perma.cc/V6D9-4HLY] (describing the Authors Guild's desire to create an opt-in licensing system for using books to train models).
103 See generally Authors Guild, Inc., v. HathiTrust, 755 F.3d 87 (2d Cir. 2014); Authors Guild v. Google, Inc., 804 F.3d 202, 216–17 (2d Cir. 2015).
104 See supra note 64.
105 See, e.g., Henderson et al., supra note 1, at 2 (distinguishing between "training a machine learning model on copyrighted data" as likely being fair use and "training and deploying" models for generative outputs that are similar to the underlying data, which is less likely to be fair use).
106 Cornell Library, Copyright Term and the Public Domain in the United States, CORNELL UNIV. LIBR., https://guides.library.cornell.edu/copyright/publicdomain [perma.cc/KR7Z-7UTW].
107 Id.
108 As scholars noted years ago, figuring out whether a book is under copyright is part of the orphan works issue which occurs when copyright holders "cannot be located by a reasonably diligent search." See generally David R. Hansen et al., Solving the Orphan Works Problem for the United States, 37 COLUM. J.L. & ARTS 1, 3 (2013); Jennifer M. Urban, How Fair Use Can Help Solve the Orphan Works Problem, 27 BERKELEY TECH. L.J. 1379, 1382 (2012) (noting "no feasible way" to obtain permission to archive orphan works).
109 Cf. Exec. Order No. 14110, 88 Fed. Reg. (Oct. 30, 2023) (explaining "bias" and "discrimination" have occurred in AI systems and calling for prevention of future potential harm); Deven R. Desai et al., Using Algorithms to Tame Discrimination: A Path to Diversity, Equity, and Inclusion, 56 U.C. DAVIS L. REV. 1703, 1714 n.50, 1716 (2022) (examining claims that data necessarily have disparate impact or under-represents a group).
110 Cf. Amanda Levendowski, How Copyright Law Can Fix Artificial Intelligence's Implicit Bias Problem, 93 WASH. L. REV. 579, 614–616 (2018) (discussing limits of public domain works for training as an issue of bias); accord Sag, supra note 64.
111 Cf. Danielle Allen, We've Lost Our Way on Campus. Here's How We Can Find Our Way Back, WASH. POST (Dec. 13, 2023), https://www.washingtonpost.com/opinions/2023/12/10/antisemitism-campus-culture-harvard-penn-mit-hearing-path-forward/ [perma.cc/E8SS-4M2E] ("By diversity, we mean simply social heterogeneity, the idea that a given community has a membership deriving from plural backgrounds, experiences, and identities. Race, ethnicity, gender identity, sexual orientation, socioeconomic background, disability, religion, political outlook, nationality, citizenship, and other forms of formal status have all been among the backgrounds, experiences, and identities to which the Task Force has given special attention, but we have also attended to issues of language, differences in prior educational background, veteran status, and even differences in research methodologies and styles.").
112 This problem is inherent to data of all kinds. You must include individuals' data to respect their individuality. FRANK WEBSTER, THEORIES OF THE INFORMATION SOCIETY 208 (1995) ("If we [as a society] are going to respect and support the individuality of members, then a requisite may be that we know a great deal about them."); accord Deven R. Desai, Exploration and Exploitation: An Essay on (Machine) Learning, Algorithms, and Information Provision, 47 LOY. U. CHI. L.J. 541, 577 (2015).
113 See Laura Garwin & Tim Lincoln, Chronology of Twentieth-Century Science, in A CENTURY OF NATURE: TWENTY-ONE DISCOVERIES THAT CHANGED SCIENCE AND THE WORLD xv, xv–xviii (Laura Garwin and Tim Lincoln eds., Univ. Chi. Press) (2003).
114 Sag, supra note 3.
115 Id.
116 Campbell v. Acuff-Rose Music, Inc., 510 U.S. 569, 585 n.18 (1994).
117 Id.; accord Castle Rock Ent., Inc., v. Carol Pub. Grp., Inc., 150 F.3d 132, 146 (1998); Wright v. Warner Books, Inc., 953 F.2d 731, 737 (1991) ("lack of permission is 'beside the point'").
118 See infra Part III.A.3 (discussing the world where copyright holders dictate uses for software is not desired as it would create unending problems for software development and use in the future).
119 Wright, 953 F.2d, at 737.
120 Harper & Row Publishers, Inc. v. Nation Enters., 471 U.S. 539 (1985).
121 Id. at 563.
122 Authors Guild, Inc., v. Google Inc., 954 F. Supp. 2d 282, 203 (S.D.N.Y. 2013) (emphasis added).
123 See, e.g., Pamela Samuelson et al., The FTC's Misguided Comments on Copyright Office Generative AI Questions, PATENTLYO (Dec. 11, 2023), https://patentlyo.com/patent/2023/12/misguided-copyright-generative.html [https://perma.cc/F2RY-VGBG]. But see Pamela Samuelson, Generative AI Meets Copyright, 381 SCI. 158, 159 (2023) (noting propriety of using data from the "open Internet" which suggest lawful access).
124 Sega Enters. Ltd. v. Accolade, Inc., 977 F.2d 1510, 1514 (9th Cir. 1992).
125 Sony Comput. Ent., Inc. v. Connectix Corp., 203 F.3d 596, 601 (9th Cir. 2000).
126 A.V. ex rel. Vanderhye v. iParadigms, LLC, 562 F.3d 630 (4th Cir. 2009).
127 A.V. v. iParadigms Lab. Co., 544 F. Supp. 2d 473, 478–480 (E.D. Va. 2008) (describing the nature of the gathering of the data and the contract authorizing use of the data).
128 Castle Rock Ent. v. Carol Pub. Grp., Inc., 955 F. Supp. 260, 262 (S.D.N.Y. 1997) (detailing ways defendant built trivia list for the T.V. show, Seinfeld, including watching broadcasts and videotape recordings).
129 NXIVM Corp. v. Ross Inst., 364 F.3d 471, 478–480 (2d Cir. 2004).
130 Id.
131 Michelle Cheng, "Shadow Libraries" Are at the Heart of the Mounting Copyright Lawsuits Against OpenAI, MIT TECH. REV. (July 10, 2023), https://qz.com/shadow-libraries-are-at-the-heart-of-the-mounting-cop-1850621671 [https://perma.cc/JBG7-5LWQ].
132 Grimmelmann, supra note 1, at 674–675 (noting an approach that "ignores robots" encourages copying to feed the robots, including in an "underground economy").
133 James Somers, Torching the Modern-Day Library of Alexandria, THE ATLANTIC (Apr. 20, 2017), https://www.theatlantic.com/technology/archive/2017/04/the-tragedy-of-google-books/523320/ [https://perma.cc/4JUY-FX94].
134 Id.
135 See SLJ Staff, SLJ's Average Book Prices, SLJ (Mar. 01, 2023), https://www.slj.com/story/slj-average-book-prices-2022-follett-baker-taylor [https://perma.cc/XP6W-YEQY] (showing range of new book prices for libraries between 2021 and 2022). By extension, after several lawsuits were filed, OpenAI began negotiating with news publishers about payments for access to their material. See Sharin Ghaffary, OpenAI's COO Pitches Startup as Friendly Partner to Publishers, BLOOMBERG (Jan. 11, 2024 4:18 PM), https://www.bloomberg.com/news/newsletters/2024-01-11/openai-s-coo-pitches-startup-as-friendly-partner-to-publishers [https://perma.cc/DT7E-GS66] (noting closed deals with Associated Press and Alex Springer and ongoing negotiations with other publishers).
136 Appenzeller et al., supra note 80.
137 Authors Guild, Inc., v. HathiTrust, 755 F.3d 87, 97–98 (2d Cir. 2014); Authors Guild v. Google, Inc., 804 F.3d 202, 216–17 (2d Cir. 2015).
138 See, e.g., Alexandra Bruell, New York Times to Bezos-Backed AI Startup: Stop Using Our Stuff, WALL ST. J. (Oct. 15, 2024 9:21 AM), https://www.wsj.com/business/media/new-york-times-to-bezos-backed-ai-startup-stop-using-our-stuff-20faf2eb [https://perma.cc/E8XR-ZV7S] (noting a new lawsuit against an AI company and the trend towards demanding AI companies stop using content or pay for it).
139 Cf. Sag, supra note 3, at 1918 (noting the possibility that no one is willing to sell access to their books and so turning to shadow libraires is an understandable option).
140 Id. Professor Sag asserts "prohibiting academic research on illegal text corpora will generally not benefit copyright owners nor further the interests copyright law is designed to promote." Although we tend to agree with his point, the situation is not resolved.
141 See, e.g., Henderson, et al., supra note 1, at 2 ("Under United States ('U.S.') law, copyright for a piece of creative work is assigned 'the moment it is created and fixed in a tangible form that it is perceptible either directly or with the aid of a machine or device' (U.S. Copyright Office, 2022). The breadth of copyright protection means that most of the data that are used for training the current generation of foundation models is copyrighted material.").
142 Brown et al., supra note 26.
143 See, e.g., Rebecca Bellan, News Outlets Are Accusing Perplexity of Plagiarism and Unethical Web Scraping, TECHCRUNCH (July 2, 2024, 8:00 AM), https://techcrunch.com/2024/07/02/news-outlets-are-accusing-perplexity-of-plagiarism-and-unethical-web-scraping/ [https://perma.cc/5PNB-3V5K] (explaining web scraping as automated crawlers often used by search engines so that websites can be "included in search results"); cf. Pamela Samuelson, Generative AI Meets Copyright, 381 SCI. 158, 159 (2023) (arguing use of "a dataset consisting of 5.85 billion hyperlinks that pair images and text descriptions from the open internet" is likely lawful given the dataset was made by a European non-profit and protected by EU law on using copyrighted material for text and data mining).
144 As discussed below, whether ignoring robots.txt or other signals violates the law is an evolving question. The point here is that the custom has relied on voluntary compliance for the most part. See, e.g., Kali Hays, A New Web Crawler Launched by Meta Last Month Is Quietly Scraping the Internet for AI Training Data, YAHOO! FINANCE: FORTUNE (Aug. 20, 2024), https://finance.yahoo.com/news/crawler-launched-meta-last-month-225937771.html [https://perma.cc/6JRR-FHPF] (noting obeying robots.txt is "not enforceable or legally binding in any way").
145 Brad Jones, Internet Archive Will Ignore Robots.txt Files to Keep Historical Record Accurate, DIGIT. TRENDS (Apr. 24, 2017), https://www.digitaltrends.com/computing/internet-archive-robots-txt/ [https://perma.cc/BPX8-HWKT].
146 Frequently Asked Questions, COMMON CRAWL, https://commoncrawl.org/faq [https://perma.cc/C3XD-252H].
147 Thom Vaughan, November/December 2023 Crawl Archive Now Available, COMMON CRAWL (Dec. 15, 2023), https://commoncrawl.org/blog/november-december-2023-crawl-archive-now-available [https://perma.cc/23D5-JCPY].
148 Ariel Bogle, New York Times, CNN and Australia's ABC Block OpenAI's GPTBot Web Crawler from Accessing Content, GUARDIAN (Aug. 24, 2023, 8:31 PM), https://www.theguardian.com/technology/2023/aug/25/new-york-times-cnn-and-abc-block-openais-gptbot-web-crawler-from-scraping-content [https://perma.cc/YCS7-2J3P].
149 Terms of Service, N.Y. TIMES (May 10, 2024), https://help.nytimes.com/hc/en-us/articles/115014893428-Terms-of-Service [https://perma.cc/9JX4-52ZY] (Section 4.1(3) states that w: ithout NYT's prior written consent, you shall not: "use the Content for the development of any software program, . . . including, but not limited to, training . . . a machine learning or artificial intelligence (AI) system . . .").
150 See Bogle, supra note 148.
151 Emilia David, The BBC Is Blocking OpenAI Data Scraping but is Open to AI-Powered Journalism, THE VERGE (Oct. 6, 2023, 3:16 PM), https://www.theverge.com/2023/10/6/23906645/bbc-generative-ai-news-openai [https://perma.cc/W4XN-3NAQ].
152 For example, a large U.S. law firm used its material to create an internal Chatbot to aid in finding legal documents, drafting material, and knowledge discovery. See Bob Ambrogi, Four Months after Launching Its 'Homegrown' GenAI Tool, Law Firm Gunderson Dettmer Reports on Results so Far, New Features, and a Surprise on Cost, LAWSITES (Dec. 20, 2023), https://www.lawnext.com/2023/12/four-months-after-launching-its-homegrown-genai-tool-law-firm-gunderson-dettmer-reports-on-results-so-far-new-features-and-a-surprise-on-cost.html [https://perma.cc/WK58-E9VA].
153 See Kyle Orland, How the New York Times Is Using Generative AI as a Reporting Tool, ARS TECHNICA (Oct. 30, 2024, 1:19 PM), https://arstechnica.com/ai/2024/10/the-new-york-times-shows-how-ai-can-aid-reporters-without-replacing-them/ [https://perma.cc/FVW3-LN38].
154 See Alexandra Berzon et al., Inside the Movement Behind Trump's Election Lies, N.Y. TIMES (Oct. 28, 2024), https://www.nytimes.com/interactive/2024/10/28/us/politics/inside-the-movement-behind-trumps-election-lies.html [https://perma.cc/XZG9-QKVQ].
155 Id.
156 Id.
157 Id.
158 Id.
159 See, e.g., James Bennet, When the New York Times Lost Its Way, The ECONOMIST (Dec. 14, 2023), https://www.economist.com/1843/2023/12/14/when-the-new-york-times-lost-its-way [https://perma.cc/7AU5-HVZC].
160 See N.Y. TIMES, supra, note 149, § 4.1(1).
161 Id. § 4.1(2).
162 Id. § 4.1(3).
163 Id. § 4.1(5).
164 As discussed below, a data repository may allow testing, but that presumes there is such a repository. If you are scraping data and/or assembling data from one or several sources, you are the likely repository of the data and so need a copy of the data.
165 The Upshot, N.Y. TIMES, https://www.nytimes.com/newsletters/upshot [https://perma.cc/74PS-DJ9T]; Natalie Gil, New York Times Launches Data Journalism Site The Upshot, GUARDIAN (April 22, 2014), https://www.theguardian.com/media/2014/apr/22/new-york-times-launches-data-journalism-site-the-upshot [https://perma.cc/4LZJ-ZPRP].
166 See Gil, supra note 165.
167 The Upshot Staff, 10 Data Points and Documents that Made Us [ponder emoji] in 2023, N.Y. TIMES (Dec. 30, 2023), https://www.nytimes.com/interactive/2023/12/30/upshot/2023-year-in-review.html [https://perma.cc/7PQ6-S5TR].
168 Id.
169 Nate Cohn, The 6 Kinds of Republican Voters, N.Y. TIMES (Aug. 17, 2023), https://www.nytimes.com/interactive/2023/08/17/upshot/six-kinds-of-republican-voters.html [https://perma.cc/5ZHS-HLG5].
170 Id.
171 Cf. Vladimir Estivill-Castro, Why So Many Clustering Algorithms: A Position Paper, 4 ACM SIGKDD EXPLORATIONS NEWSLETTER 65, 65 (2002) ("Clustering is a central task for which many algorithms have been proposed.").
172 See Cohn, supra note 169.
173 Id.
174 Alicia Parlapiano, Just How Formulaic Are Hallmark and Lifetime Holiday Movies? We (Over)analyzed 424 of Them, N.Y. TIMES (Dec. 22, 2023), https://www.nytimes.com/interactive/2023/12/23/upshot/hallmark-lifetime-christmas.html [https://perma.cc/L38B-L79H].
175 IMDb Conditions of Use, IMDB, https://www.imdb.com/conditions [https://perma.cc/2QPM-MJWJ] ("The IMDb Services or any portion of such services may not be reproduced, duplicated, copied, sold, resold, visited, or otherwise exploited for any commercial purpose without express written consent of IMDb.").
176 See Parlapiano, supra note 174.
177 AWS MARKETPLACE, https://aws.amazon.com/marketplace/search/results?FULFILLMENT _OPTION_TYPE=DATA_EXCHANGE&CREATOR=0af153a3339f48c28b423b9fa26d3367&DATA_AVAILABLE_THROUGH=API_GATEWAY_APIS&filters=FULFILLMEN T_OPTION_TYPE%2CCREATOR%2CDATA_AVAILABLE_THROUGH [https://perma.cc/JZ54UAT4] (showing prices to access IMDb datasets).
178 Cf. Cohn, supra note 169.
179 Cf. Tasini v. N.Y. Times, 206 F.3d 161, 165 (2d. Cir. 2000) (holding that a copyright license is specific and a license that did not include "electronic databases" was violated when publishers put the material into such databases).
180 Josef Adalian, Why Are These Classic Shows Nowhere to Be Found on Streaming?, VULTURE (Nov. 18, 2016), https://www.vulture.com/2016/11/why-cant-these-shows-be-found-on-streaming.html [https://perma.cc/6F5F-GYMZ] ("And then there's what's often the biggest obstacle to getting a show on to a streaming platform: music rights.").
181 Id.
182 See, e.g., Zoe G. Phillips, Prince's Music Companies Are "Working to Resolve Matters" Regarding Unreleased Doc Accusing Musician of Abuse, THE HOLLYWOOD REP. (Sept. 9, 2024, 5:18 PM), https://www.hollywoodreporter.com/news/music-news/prince-music-companies-respond-unreleased-doc-accusing-abuse-1235996322/ [https://perma.cc/HNK4-XM5A].
183 See Michael A. Heller, The Tragedy of the Anticommons: Property in the Transition from Marx to Markets, 111 HARV. L. REV. 621, 622–23 (1998); Michael A. Heller & Rebecca S. Eisenberg, Can Patents Deter Innovation? The Anticommons in Biomedical Research, 280 SCI. 698, 698–700 (1998) (explaining the idea of the anticommons in intellectual property).
184 See HELLER, supra note 17, at 3–4, 20.
185 Jerome H. Reichman & Ruth L. Okediji, When Copyright Law and Science Collide: Empowering Digitally Integrated Research Methods on a Global Scale, 96 MINN. L. REV. 1362, 1368 (2012).
186 See, e.g., Michael Nuñez, OpenAI Strikes Content Deal with Condé Nast, Raising Questions About Future of Publishing, VENTUREBEAT (Aug. 20, 2024, 12:20 PM), https://venturebeat.com/ai/openai-strikes-content-deal-conde-nast-future-of-publishing/ [https://perma.cc/2BYF-2S3B]; Bellan, supra note 74; Benj Edwards & Ashley Belanger, Journalists "Deeply Troubled" by OpenAI's Content Deals with Vox, The Atlantic, ARS TECHNICA (May 31, 2024, 4:56 PM), https://arstechnica.com/information-technology/2024/05/openai-content-deals-with-vox-and-the-atlantic-spark-criticism-from-journalists/ [https://perma.cc/SR4A-PTYS].
187 See, e.g., Damon Beres, A Devil's Bargain with OpenAI, THE ATLANTIC (May 29, 2024), https://www.theatlantic.com/technology/archive/2024/05/a-devils-bargain-with-openai/678537/ [https://perma.cc/W6PU-8DKH] (quoting Anna Bross of The Atlantic that the license is not for "syndication" and that display of content must still be within fair use).
188 See, e.g., Cho, supra note 102.
189 FTC, Comment on Artificial Intelligence and Copyright (Oct. 30, 2023), https://www.regulations.gov/comment/COLC-2023-0006-8630 [https://perma.cc/BXS6-JVRN].
190 Id. at 5.
191 See, e.g., Beres, supra note 187 (noting the issues around AI as a threat to writers and the way previous deals between technology companies and journalism ended up not benefitting the news outlet); Nuñez, supra note 186; Edwards & Belanger, supra note 186.
192 Kyle Wiggers, AI Training Data Has a Price Tag that Only Big Tech Can Afford, TECH CRUNCH (June 1, 2024, 6:00 AM), https://techcrunch.com/2024/06/01/ai-training-data-has-a-price-tag-that-only-big-tech-can-afford/ [https://perma.cc/3JZJ-3Q2S] (noting the growing need for more data to train LLM models and the multi-billion dollar market emerging to meet that demand).
193 Id.
194 Id.
195 Field v. Google Inc., 412 F. Supp. 2d 1106, 1115–17 (D. Nev. 2006); cf. Samuelson, supra note 5, at 160 (relying on access to Open Internet as part of the argument for fair use when training on Internet images).
196 MidlevelU, Inc. v. ACI Info. Grp., 989 F.3d 1205, 1217–18 (11th Cir. 2021).
197 One can think of the issue as a chilling effect. The mere threat of lawsuits and high damages has a history of preventing action, especially when intellectual property is at stake. See, e.g., Deven R. Desai, Speech, Citizenry, and the Market: A Corporate Public Figure Doctrine, 98 MINN. L. REV. 455, 477–78 (2013) (connecting chilling effects of tort lawsuits in N.Y. Times Co. v. Sullivan to trademark enforcement suits).
198 Cf. Cambridge Univ. Press v. Becker, 863 F. Supp. 2d 1190, 1201 (N.D. Ga. 2012) (publishers sued each member of the Board of Regents of the University System of Georgia as responsible for the university's copyright policies).
199 See, e.g., CellStrat, Real-World Use Cases for Large Language Models (LLMs), MEDIUM (Apr. 25, 2023), https://cellstrat.medium.com/real-world-use-cases-for-large-language-models-llms-d71c3a577bf2 [https://perma.cc/7PB2-8ETK].
200 See McCarthy et al., supra note 23, at 14.
201 Justine Calma, Twitter Just Closed the Book on Academic Research, THE VERGE (May 31, 2023 8:19 AM), https://www.theverge.com/2023/5/31/23739084/twitter-elon-musk-api-policy-chilling-academic-research [https://perma.cc/N7XD-D2V2].
202 Id. ("[E]ven its 'outrageously expensive' enterprise tier, the coalition argued, wasn't enough to conduct some ambitious studies or maintain important tools.").
203 See, e.g., Cho, supra note 102.
204 Arvind Narayanan & Sayash Kapoor, Generative AI's End-Run Around Copyright Won't Be Resolved by the Courts. AI SNAKE OIL (Jan. 22, 2024) https://www.aisnakeoil.com/p/generative-ais-end-run-around-copyright [https://perma.cc/Z3YC-5F8M].
205 Cf. Deven R. Desai & Joshua A. Kroll, Trust but Verify, 31 HARV. J. L & TECH. 1, 17–19, 29 (explaining the limits of detecting all possible results from software).
206 See Narayanan & Kapoor, supra note 204 (noting the speed with which OpenAI fixed a ChatGPT web browsing feature that could "bypass paywalls" once the computer scientists mentioned it on X).
207 Katherine Lee et al., Talkin' 'Bout AI Generation: Copyright and the Generative-AI Supply Chain, J. COPYRIGHT SOC'Y (forthcoming 2024) (manuscript at 12) (July 27, 2023) ("precedents have come to set expectations- among copyright owners, in the technology industry, in the copyright bar, and in the judiciary - for what legally 'responsible' behavior by an online intermediary looks like. A generative-AI service operator that does not appear to be making a good-faith effort to achieve something like this system may strike a court as intending to induce infringement.").
208 Authors Guild v. Google, Inc., 804 F.3d 202, 208–210 (2d Cir. 2015) (detailing with approval the nature of snippet view).
209 Id. at 210 (Google's program does not allow a searcher to increase the number of snippets revealed by repeated entry of the same search term or by entering searches from different computers. A searcher can view more than three snippets of a book by entering additional searches for different terms. However, Google makes permanently unavailable for snippet view one snippet on each page and one complete page out of every ten-a process Google calls "blacklisting.").
210 See supra Part III.A (discussing how academic benchmark goals differ from commercial ones).
211 David Streitfeld, Google Concedes that Drive-by Prying Violated Privacy, N.Y. TIMES (Mar. 12, 2013), https://www.nytimes.com/2013/03/13/technology/google-pays-fine-over-street-view-privacy-breach.html [https://perma.cc/N47P-Q26S].
212 Id.
213 Id.
214 Id.
215 See generally Adam D. I. Kramer et al., Experimental Evidence of Massive-Scale Emotional Contagion Through Social Networks, 111 PROC. NAT'L ACAD. SCI. 8788 (2014).
216 Id.
217 Charles Arthur, Facebook Emotion Study Breached Ethical Guidelines, Researchers Say, THE GUARDIAN (June 30, 2014), https://www.theguardian.com/technology/2014/jun/30/facebook-emotion-study-breached-ethical-guidelines-researchers-say [2014 4:51 AM), https://www.theguardian.com/technology/2014/jun/30/facebook-emotion-study-breached-ethical-guidelines-researchers-say [https://perma.cc/VL3E-NPDV]; Evan Selinger & Woodrow Hartzog, Facebook's Emotional Contagion Study and the Ethical Problem of Co-Opted Identity in Mediated Environments Where Users Lack Control, 12 RSCH. ETHICS 35 (2016).
218 Cf. Desai et al., supra note 109 (discussing how legal literature uses Amazon as an example of data bias and harm when in fact Amazon used good practices to manage its software).
219 Wes Davis, Sarah Silverman is Suing OpenAI and Meta for Copyright Infringement, THE VERGE (July 9, 2023), https://www.theverge.com/2023/7/9/23788741/sarah-silverman-openai-meta-chatgpt-llama-copyright-infringement-chatbots-artificial-intelligence-ai [https://perma.cc/RUQ6-3T8N].
220 Complaint, Ex. A, Silverman v. OpenAI, Inc., No. 3:23-cv-03416 (N.D. Cal. July 7, 2023).
221 See Narayanan & Kapoor, supra note 204.
222 See Henderson et al., supra note 1.
223 But see Narayanan & Kapoor, supra note 204.
224 Communications Decency Act, 47 U.S.C. § 230(c).
225 Digital Millennium Copyright Act ("DMCA"), 17 U.S.C. § 1201.
226 17 U.S.C. § 512.
227 Viacom Int'l Inc., v. YouTube, Inc., 718 F.Supp.2d 514, 524 (S.D.N.Y. 2010).
228 Viacom Int'l Inc., v. YouTube, Inc., 940 F. Supp. 2d 110, 115 (S.D.N.Y. 2013) (rejecting Viacom's argument that the volume of clips on YouTube precluded safe harbor protection).
229 Id. at 118–119 (noting cases where companies hosting potentially infringing intellectual property were shielded from liability under the DMCA).
230 Miguel Helft, Judge Sides with Google in Viacom Video Suit, N.Y. TIMES (June 23, 2010), https://archive.nytimes.com/www.nytimes.com/2010/06/24/technology/24google.html [https://perma.cc/4D5F-RSF5].
231 Id.
232 See Henderson et al., supra note 4 (noting trademark law and online sales as an example of the law embracing mitigation strategies).
233 Tiffany (NJ) Inc. v. eBay Inc., 600 F.3d 93, 97–98 (2d Cir. 2010).
234 Id. at 103.
235 Id. at 99, 109.
236 See, e.g., Lee et al., supra note 207 (offering detailed analysis about why Section 512 does not apply to generative AI systems).
237 See, e.g., Cho, supra note 95.
238 See, e.g., Samuelson et al., supra note 123.
239 See Narayanan & Kapoor, supra note 204.
240 Federal Trade Commission Comments Submitted in Response to the U.S. Copyright Office's Aug. 30, 2023 Notice of Inquiry, Docket No. 2023-6 (Oct. 30, 2023), https://www.regulations.gov/comment/COLC-2023-0006-8630 [https://perma.cc/2Z9M-MSYH].
241 See, e.g., Narayanan & Kapoor, supra note 204 (claiming the big issue around generative AI is "the injustice of labor appropriation in generative AI" and offering that the way ChatGPT hides the NY Times as source material will erode traffic to news sites).
242 See supra note 240 (noting FTC concern over "the risks associated with AI use, including violations of consumers' privacy, automation of discrimination and bias, and turbocharging of deceptive practices, imposter schemes, and other types of scams").
243 See, e.g., Cho, supra note 102.
244 See, e.g., Deven R. Desai, The Life and Death of Copyright, 2011 WIS. L. REV. 219, 245 (2011) (tracing Lockean labor theory ideas in copyright law). There are other theoretical foundations for property; see, e.g., Margaret Jane Radin, Property and Personhood, 34 STAN. L. REV. 957 (1982). But as Professor Madhavi Sunder offers, "Assertions of power over one's own identity necessarily lead to assertions of property ownership. [. . .] Property enables us to have control over our external surroundings. Seen in this light, it is not enough to see all claims for more property simply as intrusions into the public domain and violations of free speech. Instead, we may begin to see them as assertions of personhood." Madhavi Sunder, Property in Personhood, in RETHINKING COMMODIFICATION: CASES AND READINGS IN LAW AND CULTURE 164, 170 (Martha M. Ertman & Joan C. Williams eds., 2005). Thus, the theoretical foundation does not change the property instinct and the demand for control over the property.
245 See, e.g,, Mark A. Lemley, Property, Intellectual Property, and Free Riding, 83 TEX. L. REV. 1031, 1032 (2005) (arguing that use of "the rhetoric of real property, with its condemnation of 'free riding' by those who imitate or compete with intellectual property owners" has resulted in "a legal regime for intellectual property . . . in which courts seek out and punish virtually any use of an intellectual property right by another").
246 Pamela Samuelson et al., Comments in Response to the U.S. Copyright Office's Aug. 30, 2023 Notice of Inquiry, Artificial Intelligence and Copyright, Docket No. 2023-6, 26 (Oct. 30, 2023), https://www.regulations.gov/comment/COLC-2023-0006-8854 [https://perma.cc/4UC3-7NVE].
247 See supra notes 119–127 and accompanying text.
248 See Henderson et al., supra note 1.
249 See supra notes 119–127 and accompanying text.
250 A recent example of copy control restrictions hindering research involves an effort to allow access to out-of-print video games. Andy Chalk, US Copyright Law 'Forces Researchers to Explore Extra-Legal Methods' for Game Preservation, Say Historians Who Are 'Disappointed' After Being Denied a DMCA Exemption PC GAMER (Oct. 25, 2024), https://www.pcgamer.com/games/us-copyright-law-forces-researchers-to-explore-extra-legal-methods-for-game-preservation-say-historians-who-are-disappointed-after-being-denied-a-dmca-exemption/ [https://perma.cc/H5U9-3APY]. The copyright office has, however, refused to grant an exemption for such work. Id.
251 Cf. Authors Guild v. Google, Inc., 804 F.3d 202, 228 (2d Cir. 2015) (noting with approval the security measures Google took in building and maintaining the Google Books Project).
252 We acknowledge early critiques of the GBPC include issues around the quality of the optical character recognition scanning, metadata issues, and other technical created critiques of the corpus. See, e.g., Mark Davies, Making Google Books N-Grams Useful for a Wide Range of Research on Language Change, 19 INT'L J. OF CORPUS LINGUISTICS 401, 402 n.2 (2014). And yet as Google has improved just its N-Gram research offering, scholars have found the corpus useful. Id. at 415. Furthermore, advances in NLP research have faced messy data and so the point is better access to allow such research, not a claim that the dataset is somehow perfect.
253 Ari Marini, How the Google Books Team Moved 90,000 Books Across a Continent, GOOGLE (Jan. 27, 2023), https://blog.google/products/search/google-books-library-project/#logistics [https://perma.cc/9QMA-F7UE].
254 This would be a hypothesis that could be tested, though at great expense. A book of gibberish words will not factor into a model the same way as a book of common idioms. The latter lends to generalization, the former not so much. One would need analytical tools such as Shapely Values. See, e.g., Amirata Ghorbani & James Zou, Data Shapley: Equitable Valuation of Data for Machine Learning, 97 PROC. 36TH INT'L CONF. ON MACH. LEARNING 2242 (2019), https://proceedings.mlr.press/v97/ghorbani19c.html [https://perma.cc/BL3X-XXTK]. To date, computing Shapley values-how much a unit of data impacts the overall model-requires training versions of the model with and without each unit of data. While not technically impossible, it would require basically training millions of versions of, for example, GPT-3 and then comparing their outputs (there are some approximation techniques, but these are also prohibitively expensive at large scale). It is quite likely though, that no one unit of data in an LLM really stands out as being significantly more important than another for most books that follow typical language patterns.
255 See discussion supra Part II.B.2.
256 See Authors Guild, 804 F.3d at 228 (noting with approval the security measures Google took in building and maintaining the Google Books Project).
257 See AutoML, GOOGLE, https://cloud.google.com/automl?hl=en [https://perma.cc/7K8Y-JM5T].
258 See Welcome to Colab, GOOGLE, https://colab.research.google.com [https://perma.cc/3HTJ-CSRY].
259 Although we focus on books, the idea is not limited to books. Other media such as audio recordings and video could use a repository.
260 See Joel L. McKuin, Home Audio Taping of Copyrighted Works and the Audio Home Recording Act of 1992: A Critical Analysis, 16 HASTINGS COMM. & ENT. L.J. 311, 321–322 (1993) (detailing music industry concerns over new digital recording formats).
261 See 17 U.S.C. § 1004(b).
262 See 17 U.S.C. §§ 1006–1007.
263 See, e.g., Cho, supra note 102.
264 See Desai, supra note 244, at 221–22, 257.
265 See Unit Sales of Printed Books in the United States from 2004 to 2023, STATISTA, https://www.statista.com/statistics/422595/print-book-sales-usa/ [https://perma.cc/599F-GCWK].
266 See Amy Watson, E-Books in the U.S. - Statistics & Facts, STATISTA (Dec. 18, 2023), https://www.statista.com/topics/1474/e-books/#topicOvervi [https://perma.cc/MY9C-CZH6].
267 In that sense, the approach follows the logic of addressing unauthorized copying by spreading the cost and enabling access rather than trying to eliminate the practice.
268 The reason pre-ChatGPT release matters is that after its release, the Internet has changed as a test bed for language. Rather than being words assembled by humans alone, the Internet has filled up quickly with LLM-generated text, and that alters how well the dataset can be used for assessing whether the model assembles words as humans do. After all, if my control set has machine generated language, the set is no longer a good benchmark. See Melissa Heikkilä, How AI-Generated Text Is Poisoning the Internet, MIT TECH. REV. (Dec. 20, 2022), https://www.technologyreview.com/2022/12/20/1065667/how-ai-generated-text-is-poisoning-the-internet/ [https://perma.cc/DFG9-RU3Q] ("In the future, it's going to get trickier and trickier to find good-quality, guaranteed AI-free training data, says Daphne Ippolito, a senior research scientist at Google Brain, the company's research unit for deep learning."); see Robert McMillan, AI Junk Is Starting to Pollute the Internet, WALL ST. J. (July 12, 2023 8:00 AM), https://www.wsj.com/articles/chatgpt-already-floods-some-corners-of-the-internet-with-spam-its-just-the-beginning-9c86ea25 [https://perma.cc/7UJS-BXF8] ("Should the internet increasingly fill with AI-generated content, it might become a problem for the AI companies themselves. That is because their large language models, the software that forms the basis of chatbots such as ChatGPT, train themselves on public data sets. As these data sets become increasingly filled with AI-generated content, researchers worry that the language models will become less useful, a phenomenon known as 'model collapse.'").
269 We acknowledge that while editing this Article, efforts have been made to create open-access, public domain datasets that avoid copyright issues. See, e.g., Wiggers, supra note 192 (describing EleutherAI and Hugging Face offerings). Although these efforts may help resolve some dataset issues, the ongoing problem of good and up-to-date data will necessarily run into copyright issues.
270 Sobel, supra note 1, at 54–56 (discussing cases holding that copies of copyrighted images and other copyrighted material are protected as fair use).
271 Samuelson et al., supra note 246 ("It is arguable that respect for opt-outs should be part of the fair use analysis as a general consideration.").
272 Id. at 24.
273 See Artificial Intelligence and Intellectual Property – Part II: Copyright: Hearing Before the Subcomm. on Intell. Prop. of the U.S. S. Comm. on the Judiciary, 118th Cong. 6 (2023) (statement of Matthew Sag, Professor of Law, Emory University School of Law).
274 See Castle Rock Ent. v. Carol Pub. Grp., Inc., 955 F. Supp. 260, 262 (S.D.N.Y. 1997).
© 2024. This work is published under https://scholarlycommons.law.northwestern.edu/njtip/ (the “License”). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.