Sunstein Insights

Back to All Publications

ChatGPT: Is Theft a Bug or a Feature?

July 10, 2023

Thomas C. Carey | Partner, Business Chair View more articles

Thomas is a member of our Business Practice Group

ChatGPT has been making headlines lately. The lawyer who obtained fake judicial holdings from ChatGPT and then cited them in his brief to the court is the most recent example. This misuse of artificial intelligence (AI) resulted in little harm other than to the reputation of the lawyers involved. But the potential for much greater harm looms large in the minds of many. In fact, in May 2023 a group of 350 executives and scientists signed a statement that “mitigating the risk of [human] extinction from A.I. should be a global priority, alongside ... nuclear war”.

While the risks associated with AI are high, so are the potential rewards, leading some to take seemingly contradictory actions. Bill Gates, one of the signatories to the statement, recently invested in Inflection AI, a start-up that is working on technology similar to that of ChatGPT. And while Microsoft has invested more than $10 billion in ChatGPT, two of its C-suite officers signed the foreboding statement about the risk of extinction.

The European Union, a leader in privacy regulation, is trying to develop a regulatory framework for AI deployment. For example, the European Data Protection Board has launched a dedicated task force on ChatGPT. But because of the ponderous nature of the EU legislative process, it will take years for the EU to finalize any AI regulation. The Biden White House has published a Blueprint for an AI Bill of Rights, but it does not propose any specific legislation.

The speed at which AI is developing will likely outpace the legislative response. What is the chance that existing laws may be used to some meaningful guardrails on the development of AI? Two class action lawsuits naming Open AI, the owner of Chat GPT, as a defendant were filed in late June in the Northern District of California. They may provide an answer to that question.

The complaint in Tremblay v. Open AI., Inc. is founded almost entirely on copyright claims. The named plaintiffs in this case are authors who have filed copyright registrations protecting their books. The complaint explains that ChatGPT is a “large language model” (LLM) that is trained by copying massive amounts of text as a training dataset. By learning natural expressive language from this dataset, the LLM is able to respond to question in ways that seem to be coming from a human.

The complaint in Tremblay cites a 2018 paper introducing an early version of ChatGPT as saying that it used over 7,000 books as part of its training dataset. The complaint then lays out a trail of clues that suggest that later versions of ChatGPT used one or more “shadow libraries” containing illegally copied books and used them in the ChatGPT training dataset. Those libraries contain the text of roughly 300,000 books.

The Tremblay complaint says that ChatGPT was able to generate very accurate summaries of books written by the plaintiffs, suggesting that it had ingested them entirely. On that basis, the complaint seeks relief on behalf of all authors whose books were illegally copied as part of the ChatGPT training dataset. In its prayer for relief, the plaintiffs ask for an award of statutory damages and attorneys’ fees (available under the Copyright Act).

This complaint is relatively simple and straightforward. If successful, the plaintiffs may increase the cost of doing business for ChatGPT, but they do not seem poised to slow it down.

The complaint in P.M. v. Open AI is more ambitious. It too is a class action lawsuit, but it is based primarily on violations of state and federal privacy laws. Among them: the Electronic Communications Privacy Act, the Children’s Online Privacy Protection Act, the California Invasion of Privacy Act, the California Consumer Privacy Act and the Illinois Biometric Information Privacy Act.

The complaint describes but, to protect them from harassment does not identify, 16 individuals whose privacy rights are alleged to have been violated by ChatGPT. This lengthy complaint begins as follows:

On October 19, 2016, University of Cambridge Professor of Theoretical Physics Stephen Hawking predicted, “Success in creating AI could be the biggest event in the history of our civilization. But it could also be the last, unless we learn how to avoid the risks.” …
The future Professor Hawking predicted has arrived in just seven short years. Using stolen and misappropriated personal information at scale, Defendants have created powerful and wildly profitable AI and released it into the world without regard for the risks. In so doing, Defendants have created an AI arms race in which Defendants and other Big Tech companies are onboarding society into a plane that … has at least a 10% chance of crashing and killing everyone on board.
...
Defendants’ disregard for privacy laws is matched only by their disregard for the potentially catastrophic risk to humanity. Emblematic of both the ultimate risk—and Defendants’ open disregard—is this statement from Defendant OpenAI’s CEO Sam Altman: “AI will probably most likely lead to the end of the world, but in the meantime, there’ll be great companies.”

The complaint alleges that ChatGPT can be used to create virtually undetectable malware at massive scale and to create autonomous weapons. It describes the creation of a Chat-GPT clone called Chaos GPT that created, as one of its own objectives, the destruction of humanity. It then, of its own volition, searched the internet for weapons of mass destruction seeking to obtain one.

The P.M. complaint alleges that ChatGPT’s training dataset included, in addition to the books described in the Tremblay complaint, Common Crawl, WebTex2 and Wikipedia. The complaint provides this information about the first two datasets:

Common Crawl is a trillion word collection of text and metadata from webpages and websites scraped over a 12-year period; and
WebTex2 was built by scraping every webpage linked to on Reddit that had received at least three “likes” (or “Karma” votes, to use Reddit terminology). These links would include text, videos and audio from YouTube, Facebook, TikTok, Snapchat and Instagram. The complaint alleges that this scraping in ongoing.

The privacy dangers of Common Crawl are illustrated by the story of a woman who was able to determine that her private medical file - including photographs of her body taken while she was undergoing treatment for a rare disease – ended up on-line and were gathered into the Common Crawl dataset.

The complaint describes an ongoing ingestion of data and text from more common sources. ChatGPT is now integrated into several Microsoft products that have millions of users, including Teams, Bing and Cortana. Thus, the complaint alleges, the operation of these products result in scavenging of data belonging to millions of people who don’t even use ChatGPT. This data is fed into ChatGPT for its continuous evolution.

According to the complaint, the integration of ChatGPT into other products does not end with Microsoft products. It is integrated into Amazon, Expedia, Instacart, Google, Zillow, OkCupid (a dating app) and many other products. Thus, the plaintiffs allege, ChatGPT has become a virtual spy, closely monitoring, recording, and training on the personal data, clicks, searches, inputs, and personal information of millions of unsuspecting individuals who may be using an Instacart to purchase grocery items, a telehealth company to make a doctor’s appointment, or simply browsing Expedia to make vacation plans.

A key element of the theory of this complaint is that ChatGPT ingests all of this data as means of improving its ability to mimic humans, and that once this ingestion happens, it is irreversible. Thus, certain rights that are specifically granted by the California Consumer Privacy Act (CCPA), including the right to be forgotten and the right to correct wrong information, are irretrievably lost. The complaint analyzes the privacy policy of ChatGPT in detail and dismisses it as window dressing in light of the nature of ChatGPT.

In describing the risks posed by ChatGPT, the complaint summarizes the plight of a U.S. law professor, Jonathan Hurly, whom ChatGPT falsely accused of sexually harassing one of his students, even providing a “source” for the purported crime via a news article that it invented. The complaint says that “Defendants call this ‘hallucination,’ but the world knows it as defamation.”

Darker problems arise. Dall-E is an image generation product that is being incorporated into ChatGPT. It can generate realistic digital images from natural language prompts. It was trained from billions of images taken from photo sites and personal blogs without notice or consent of those who are pictured. Many images show children. Dall-E has become a favorite tool for pedophiles and it (or products like it) are being used to build fake school-age personae via fabricated selfies, which are incorporated into plots to lure and groom child targets.

The Complaint alleges that ChatGPT will soon gather audio data with yet another AI product —Vall-E. Vall-E can process a mere three (3) seconds of a human voice and then speak convincingly in that voice.

The complaint proposes a number of subclasses for purposes of its class action lawsuit, varying by state of residence (California, Illinois or New York), whether they are direct users of ChatGPT, and whether they are minors.

Some of the legal theories presented by this complaint seem to stretch the scope of the statutes cited beyond their apparent coverage. But the theories based upon violations of California and Illinois privacy laws have considerable bite. There will be a long legal struggle in this litigation, but it may result in meaningful relief before any new legislation is enacted that will rein in the explosive growth of ChatGPT and other LLM products.

The defendants will likely maintain that all user data is de-identified in the course of training ChatGPT. This may not matter to the courts because the statutes involved don’t let data aggregators off the hook so easily. For example, the Illinois statute forbids the collection or capture of a person’s biometric information (which includes a scan of a face or a voiceprint) without informing the subject in advance and obtaining a release from that person. If the plaintiff’s description of how ChatGPT is trained is correct, it will be very difficult to square OpenAI’s practices with that statute.

Litigation is not the only legal challenge facing AI. Federal Trade Commissioner Alvaro Bedoya, the keynote speaker at a 2023 Global Privacy Summit, said that AI can be regulated by the FTC under Section 5 of the FTC Act if the owner makes deceptive claims about it. In the past, the FTC has based enforcement actions on discrepancies between a company’s privacy policy and its actual practices. In that regard, the complaint in P.M. lays out a fulsome blueprint for the FTC.

Finally, federal legislation is not out of the question. OpenAI chairman Sam Altman (the same man who predicted that AI developed by “great companies” would lead to a catastrophic end) recently urged a Senate panel to regulate AI and, in response, Senators Michael Bennet (D. Colo) and Peter Welch (D-Vt.) introduces a proposed the Digital Platform Commission Act of 2023, which would establish a Federal commission with a mandate to develop and enforce rules for the AI sector.

AI has great potential for good and for bad. With luck, the worst can be averted through careful controls, while the potential is realized.