Sarah Silverman is suing OpenAI. On Friday, the comedian and author, alongside novelists Christopher Golden and Richard Kadrey, filed a pair of complaints against OpenAI and Meta ( via Gizmodo). The group alleges the firms trained their large language models on copyrighted materials, including works they published, without obtaining consent.
The complaints center around the datasets OpenAI and Meta allegedly used to train ChatGPT and LLaMA. In the case of OpenAI, while it’s “Books1” dataset conforms approximately to the size of Project Gutenberg — a well known copyright-free book repository — lawyers for the plaintiffs argue that the “Books2” datasets is too large to have derived from anywhere other than so-called “shadow libraries” of illegally available copyrighted material, such as Library Genesis and Sci-Hub. Everyday pirates can access these materials through direct downloads, but perhaps more usefully for those generating large language models, many shadow libraries also make written material available in bulk torrent packages. One exhibit from Silverman’s lawsuit involves an exchange between the comedian’s lawyers and ChatGPT. Silverman’s legal team asked the chatbot to summarize The Bedwetter, a memoir she published in 2010. The chatbot was not only able to outline entire parts of the book, but some passages it relayed appear to have been reproduced verbatim.
Silverman, Golden and Kadrey aren’t the first authors to sue OpenAI over copyright infringement. In fact, the firm faces a host of legal challenges over how it went about training ChatGPT. In June alone, the company was served with two separate complaints. One is a sweeping class action suit that alleges OpenAI violated federal and state privacy laws by scraping data to train the large language models behind ChatGPT and DALL-E.