A major copyright lawsuit against Meta has revealed a trove of internal communications about the company's plans to develop its open-source AI model, Llama, including discussions about “avoiding media coverage that suggests “That we used a dataset that we know is pirated.”
The messages, which were part of a series of exhibits unsealed by a California court, suggest Meta used copyrighted data when training its AI systems and worked to hide it — as it did to rivals like OpenAI and Mistral. Was running to defeat. parts of messages appear first Last week.
In an October 2023 email to Meta AI researcher Hugo Touvron, Ahmed Al-Dahle, Meta's vice president of generic AI, said, Wrote that the company's goal “Needs to be GPT4”, referring to the larger language model OpenAI. Announced in March 2023Meta had to “learn how to make the limit and win this race,” Al-Dahle said. was clearly involved in those plans Book piracy site Library Genesis (LibGen) To train your AI system.
One Undated email from Meta Director of Products Soni ThikanathSent to Joël Pineau, vice president of AI research, considering whether to use Libgen only internally, for benchmarks included in blog posts, or to create trained models on-site. In the email, Thekanath writes that “GenAI has been approved to use LibGen for Llama3… with several agreed mitigations” after elevating it to “MZ” – presumably Meta CEO Mark Zuckerberg. As stated in the email, Thekanath believed that “Libgen is essential to meet SOTA [state-of-the-art] numbers,” adding “It is known that OpenAI and Mistral are (verbally) using the library for their models.” Mistral and OpenAI have not stated whether they use LibGen. (The Verge Contacted both for more information).
The court documents stem from a class action lawsuit Author Richard Cadre, comedian Sarah Silverman and others filed a lawsuit against Meta, accusing it of using illegally obtained copyrighted material to train its AI models in violation of intellectual property laws. Meta, like other AI companies, has argued that the use of copyrighted material in training data should be considered legal fair use. The Verge Meta contacted with a request for comment but did not immediately receive a response.
Some “mitigation” for using LibGen included conditions that Meta must remove data clearly marked as pirated/stolen from the site while refraining from externally citing “the use of any training data” “. Thekanath's email also said the company would require the company's models to create a “red team” for biological weapons and CBRNE. [Chemical, Biological, Radiological, Nuclear, and Explosives]risk.
The email also explained some of the “policy risks” posed by the use of Libgen, including how regulators might react to media coverage suggesting Meta's use of pirated content. “This may weaken our negotiating position with regulators on these issues,” the email said. April 2023 talks The showdown between Meta researcher Nikolay Bashlyakov and AI team member David Esiobu, with Bashlyakov even admitting that he is “not sure we can use Meta's IP to load via torrent [of] Pirated content.
Other internal documents Show the measures taken by Meta to obscure copyright information in LibGen's training data. A document titled “Observations on libgen-sciencemag” shows comments left by staff about how to improve the dataset. One suggestion is to “remove excess copyright headers and document identifiers,” including any lines containing “ISBN,” “Copyright,” “All Rights Reserved,” or the copyright symbol. Other notes mentioned removing more metadata “to avoid potential legal complications”, as well as consideration of whether to remove a paper list of authors “to reduce liability”. .
Last June, the new York Times informed On the mad rush inside Meta after ChatGPT's launch, it was revealed that the company had hit a major hurdle: it had used up nearly every available English book, article, and poem it could find online. Eager for more data, executives reportedly discussed buying Simon & Schuster outright and considered hiring contractors in Africa to summarize the books without permission.
In the report, some officials justified their approach by pointing to “market precedent” of using OpenAI's copyrighted works, while others argued Google's 2015 court victory established its right to scan books Can provide legal protection. “The only thing that's stopping us from being as good as ChatGPT is literally just data volume,” one executive said at a meeting. the new York Times,
It has been reported that leading labs like OpenAI and Anthropic have hit a data wall, meaning they do not have enough new data to train their large language models. Many leaders have denied this, including Sam Altman, CEO of OpenAI clearly stated: “There is no wall.” OpenAI co-founder Ilya Sutskever, who Left the company last May To launch a new Frontier Lab, more clarity has been given about the capability of the Data Wall. But A major AI conference last monthSutskever said: “We have reached peak data and there will be no more. We have to deal with the data that we have. There is only one Internet.”
This lack of data has given rise to a lot of strange, new ways of obtaining unique data. bloomberg informed Leading labs like OpenAI and Google are paying digital content creators between $1 and $4 per minute for their unused video footage through third parties to train AI (both of those companies have competing AI video- are products of generation).
With companies like Meta and OpenAI hoping to develop their AI systems as quickly as possible, things are bound to get a little messy. Although A judge partially dismissed Kadrey and Silverman's class action At trial last year, the evidence outlined here could strengthen parts of their case as it moves forward in court.