23 January 2023

ChatGPT Stole Your Work. So What Are You Going to Do?

NICK VINCENT HANLIN LI

IF YOU’VE EVER uploaded photos or art, written a review, “liked” content, answered a question on Reddit, contributed to open source code, or done any number of other activities online, you’ve done free work for tech companies, because downloading all this content from the web is how their AI systems learn about the world.

Tech companies know this, but they mask your contributions to their products with technical terms like “training data,” “unsupervised learning,” and “data exhaust” (and, of course, impenetrable “Terms of Use” documents). In fact, much of the innovation in AI over the past few years has been in ways to use more and more of your content for free. This is true for search engines like Google, social media sites like Instagram, AI research startups like OpenAI, and many other providers of intelligent technologies.

This exploitative dynamic is particularly damaging when it comes to the new wave of generative AI programs like Dall-E and ChatGPT. Without your content, ChatGPT and all of its ilk simply would not exist. Many AI researchers think that your content is actually more important than what computer scientists are doing. Yet these intelligent technologies that exploit your labor are the very same technologies that are threatening to put you out of a job. It’s as if the AI system were going into your factory and stealing your machine.

But this dynamic also means that the users who generate data have a lot of power. Discussions over the use of sophisticated AI technologies often come from a place of powerlessness and the stance that AI companies will do what they want, and there’s little the public can do to shift the technology in a different direction. We are AI researchers, and our research suggests the public has a tremendous amount of “data leverage” that can be used to create an AI ecosystem that both generates amazing new technologies and shares the benefits of those technologies fairly with the people who created them.

DATA LEVERAGE CAN be deployed through at least four avenues: direct action (for instance, individuals banding together to withhold, “poison,” or redirect data), regulatory action (for instance, pushing for data protection policy and legal recognition of “data coalitions”), legal action (for instance, communities adopting new data-licensing regimes or pursuing a lawsuit), and market action (for instance, demanding large language models be trained only with data from consenting creators).

Let’s start with direct action, which is a particularly exciting route because it can be done immediately. Because of generative AI systems’ reliance on web scraping, website owners could significantly disrupt the training data pipeline if they disallow or limit scraping by configuring their robots.txt file (a file that tells web crawlers which pages are off limit).

Large user-generated content sites like Wikipedia, StackOverflow, and Reddit are particularly important to generative AI systems, and they could prevent these systems from accessing their content in even stronger ways—for example, by blocking IP traffic and API access. According to Elon Musk, Twitter has recently done exactly this. Content producers should also take advantage of the opt-out mechanisms that are increasingly being provided by AI companies. For instance, programmers on GitHub can opt out of BigCode’s training data via a simple form. More generally, simply being vocal when content has been used without your consent has been somewhat effective. For example, major generative AI player Stability AI agreed to honor opt-out requests collected via haveibeentrained.com after a social media uproar. By engaging in public forms of action, as in the case of mass protest against AI art by artists, it may be possible to force companies to cease business activities that most of the public perceives as theft.

Media companies, whose work is quite important to large language models (LLMs), may also want to consider some of these ideas to restrict generative AI systems from accessing their own content, as these systems are currently getting their crown jewels for free (including, likely, this very op-ed). For instance, Ezra Klein mentioned in a recent podcast that ChatGPT is great at imitating him, probably because it downloaded a whole lot of his articles without asking him or his employer.


Critically, time is also on the side of data creators: As new events occur in the world, art goes out of style, facts change, and new restaurants open, new data flows are necessary to support up-to-date systems. Without these flows, these systems will likely fail for many key applications. By refusing to make new data available without compensation, data creators could also put pressure on companies to pay for access to it.

On the regulatory side, lawmakers need to take action to protect what might be the largest theft of labor in history, and quickly. One of the best ways to do this is clarifying that “fair use” under copyright law does not allow for training a model on content without the content owner’s consent, at least for commercial purposes. Lawmakers around the world should also work on “anti-data-laundering” laws that make it clear that models trained on data without consent have to be retrained within a reasonable amount of time without the offending content. Much of this can build on existing frameworks in places like Europe and California, as well as the regulatory work being done to make sure news organizations get a share of the revenue they generate for social media platforms. There is also growing momentum for “data dividend” laws, which would redistribute the wealth generated by intelligent technologies. These can also help, assuming they avoid some key pitfalls.

In addition, policymakers could help individual creators and data contributors come together to make demands. Specifically, supporting initiatives such as data cooperatives—organizations that make it easy for data contributors to coordinate and pool their power—could facilitate large-scale data strikes among creators and bring AI-using firms to the negotiating table.

The courts also present ways for people to take back control of their content. While the courts work on clarifying interpretations of copyright law, there are many other options. LinkedIn has been successful at preventing people who scrape its website from continuing to do so through Terms of Use and contract law. Labor law may also provide an angle to empower data contributors. Historically, companies’ reliance on “volunteers” to operate their businesses have raised important questions about whether these companies violated the Fair Labor Standards Act, and these fights could serve as a blueprint. In the past, some volunteers have even reached legal settlements with companies that benefited from their work.

There is also a critical role for the market here. If enough governments, institutions, and individuals demand “full-consent LLMs”—which pay creators for the content they use—companies will respond. This demand could be bolstered by successful lawsuits against organizations that use generative AI (in contrast to organizations that build the systems) without paying users. If applications built on top of AI models face lawsuits, there will be greater demand for AI systems that aren’t playing in the legal Wild West.

Our lab’s research (and that of colleagues) also suggests something that surprised us: Many of the above actions should actually help generative AI companies. Without healthy content ecosystems, the content that generative AI technologies rely on to learn about the world will disappear. If no one goes to Reddit because they get answers from ChatGPT, how will ChatGPT learn from Reddit content? That will create significant challenges for these companies in a way that can be solved before they appear by supporting some of the above efforts.

No comments: