December 13, 2012 - TechCrunch reported on December 12 thatHarvard UniversityandGoogleannounced the joint release of 1 million books in the public domainTraining as AIDataset.
Image source: Pexels
The data needed for AI training is costly, but is better suited for well-funded tech companies. As a result, Harvard plans to release a dataset of about 1 million public domain books thatCovering a wide range of genres, languages and authorsThis includes classic authors such as Dickens, Dante and Shakespeare who are no longer protected by copyright, as the copyrights of these works have expired over time.
While this new dataset is not yet public, and it is not clear exactly how and when it will be released, it comes from Google's long-standing program, Google Books. As such, Google will be participating in the broader release of this "valuable asset."
According to 1AI, back in March of this year, Harvard University revealed its Institutional Data Initiative (IDI) and said that the program was designed toProviding AI with "Trusted Access to Legitimate Data". It was not until after the official launch that the program confirmedFunded by Microsoft and OpenAI.
Greg Leppert, IDI's executive director, says the goal of the dataset is "Leveling the playing field", by providing the following information to the organizations that includeResearch Institutes and AI StartupsA variety of organizations, including the University of California at Berkeley, are opening up this huge dataset to help them train large-scale language models.