One Youtube The anchor filed a class action lawsuit with the U.S. District Court for the Northern District of California last Friday, alleging OpenAI The company scraped millions of videos without notifying or compensating the video owners. YouTube VideoScripts for training AI generative models.
The anchor is named David Millette from Massachusetts, USA. He accused OpenAI of grabbing videos of him and other anchor creators for training AI models. The products involved include ChatGPT, Sora, etc.
The class action lawsuit alleges that OpenAI collected the data and received “generous rewards,” but that this practice violated copyright law and YouTube’s terms of service.
Millett has currently entrusted Bursor & Fisher law firm to advance the class action lawsuit. The plaintiff requests a jury trial and demands more than $5 million (currently approximately RMB 35.683 million) in compensation from all YouTube users and creators whose data may have been involved in OpenAI training.
As we all know, generative AI models are not really intelligent. They learn the likelihood and patterns of data by processing large amounts of data samples (such as movies, recordings, papers, etc.). The training data for many models comes from public websites and data sets on the Internet. Although companies claim that their data crawling complies with the principle of "fair use", many copyright holders disagree and have filed lawsuits to stop this practice.
Video transcription content has become an important training data, especially as other data sources are exhausted. According to Originality.AI, more than 35% of the world's top websites have blocked OpenAI's web crawlers. In addition, research from MIT's Data Provenance Initiative shows that about 25% of high-quality data sources have been restricted, making the training data of AI models more scarce.
It is worth mentioning that OpenAI's Whisper model is specifically used to transcribe video audio to collect more training data. According to the New York Times, after transcribing more than one million hours of YouTube videos, the OpenAI team used these transcribed texts to train their GPT-4 model. This triggered internal discussions that this might violate YouTube's regulations.