In addition to repositories of books, the Institutional Data Initiative is also working with the Boston Public Library to scan millions of articles from various newspapers in the public domain, and says it is open to forming similar collaborations in the future. Is. It is not certain how the books dataset will be released. The Institutional Data Initiative has asked Google to work together on public distribution, and the company has pledged its support.
However IDI's dataset is released, it will join a number of similar projects, startups, and initiatives that promise to provide companies access to ample and high-quality AI training materials without the risk of running into copyright issues. Companies like Calliope Networks and Prorata have emerged Issuance of licenses and designs compensation plans Designed to get creators and rights holders paid for providing AI training data.
There are other new public-domain projects as well. Last spring, French AI startup Plius roll According to project coordinator Pierre-Carl Langlais, it has its own public-domain dataset, the Common Corpus, which contains an estimated 3 to 4 million books and periodical collections. Backed by the French Ministry of Culture, Common Corpus has been downloaded more than 60,000 times on the open source AI platform Hugging Face this month alone. Last week, Plius announced it was releasing its first set of large language models trained on this dataset, which Langlais told WIRED is the first model “exclusively trained on open data and tailored to Is. [EU] AI Act.”
Efforts are also underway to create similar MAGE datasets. AI startups spawning Issued This heat has its own source. Plus, which includes public-domain images from Wikimedia Commons as well as various museums and archives. many important cultural institutions Like the Metropolitan Museum of Art, they have long made their own archives accessible to the public as standalone projects.
Ed Newton-Rex, former Stability AI executive who now runs a nonprofit which certifies ethically trained AI tools, says the growth of these datasets shows that there is no need to steal copyrighted material to build high-performance and quality AI models. OpenAI previously told lawmakers in the United Kingdom that it “willimpossibleTo create a product like ChatGPT without using copyrighted works. “Such large public domain datasets further collapse the 'necessity defense' that some AI companies use to justify using copyrighted work to train their models,” says Newton-Rex. “
But he is still skeptical about whether IDI and projects like it will really change the status quo of training. “These datasets will only have a positive impact if they are used, possibly in conjunction with licensing other data, to replace the scraped copyrighted work. If they are added to the mix, a part of the dataset that also includes the unlicensed life work of the world’s creators, they will bring huge benefits to AI companies,” he says.