According to Wired, includingappleSome technology giants, including Youtube The video creators agreed to use their video subtitle files to trainArtificial Intelligence Model.
The creators affected by this incident include well-known technology bloggers MKBHD (Marques Brownlee), MrBeast, PewDiePie, and talk show hosts Stephen Colbert, John Oliver and Jimmy Kimmel. These subtitle files used to train AI are equivalent to the text transcription of the video.
Investigative journalists have revealed that some of the world’s richest tech companies have been using footage from thousands of YouTube videos to train AI, in violation of YouTube’s rules against scraping content from the platform without permission. More than 173,000 YouTube video subtitle files from 48,000 channels were used to train AI models,These include Apple,Nvidia, Salesforce and other Silicon Valley giants.
According to reports, the subtitle files were downloaded by a non-profit organization called EleutherAI, which claims that its purpose is to help developers train AI models. Although EleutherAI's original intention may be to provide training materials for small developers and academic researchers, the dataset is also used by technology giants such as Apple.
According to a research paper published by EleutherAI, this dataset is part of a larger dataset called "The Pile" that they released. Most of the datasets in "The Pile" are public and can be accessed by anyone with enough storage space and computing power. In addition to technology giants, some academics and developers have also used the dataset. However, companies with a market value of tens or even hundreds of billions of dollars, such as Apple, Nvidia, and Salesforce, have also mentioned in their research papers and posts how they use the dataset to train AI models.
Documents show thatApple used “The Pile” to train its much-anticipated OpenELM model a few weeks before releasing it in April.The release of the OpenELM model coincides with Apple’s announcement that it will add new AI features to iPhones and Macbooks.
It should be noted thatApple did not download the data itself, but EleutherAI did.So technically, it was EleutherAI that violated YouTube's terms of use.
While Apple and other companies may have used publicly available datasets, the incident highlights the legal risks of scraping data from the web to train AI systems. There have been cases of AI systems plagiarizing entire paragraphs of text when answering niche questions, and when companies use datasets compiled by third parties, it only increases the risk of using material without permission.