Report: OpenAI collected over 1 million hours of YouTube videos to train GPT-4

Recently, the Wall Street Journal reported that artificial intelligence companies have difficulty collecting high-quality training data. Then, the New York Times detailed how some companies are dealing with this problem, which involves the murky gray area of AI copyright law.

The story beginsOpenAIThe company, desperate for training data, reportedly developed the Whisper audio transcription model by transcribing more than 1 million hours ofYoutubeVideo to trainFirstAdvanced Large Language ModelGPT-4The New York Times reported that OpenAI knew this was legally problematic but believed it was fair use. OpenAI President Greg Brockman was personally involved in collecting the videos used.

Report: OpenAI collected over 1 million hours of YouTube videos to train GPT-4

OpenAI spokesperson Lindsay Held told The Verge that the company curates “unique” datasets for each model and uses “numerous sources, both public data and partners with non-public data.” Held also said the company is considering generating its own synthetic data.

Google also collects transcripts from YouTube, according to The New York Times' sources. Matt Bryant, a Google spokesman, said the company "trains models on some YouTube content in accordance with our agreements with YouTube creators."

Meta has similarly run into limitations on the availability of good training data, and in its efforts to catch up to OpenAI, the company has considered using copyrighted works without permission, including paying for a book license or outright acquiring a large publisher.

These companies are grappling with the problem of rapidly evaporating model training data. The Wall Street Journal wrote this week that companies could be outstripping new content by 2028. Solutions include training models on "synthetic" data they create, or taking a "curriculum learning" approach. But another option for these companies is to use whatever they can find, whether they have permission or not, which could raise concerns about copyright law.

statement:The content is collected from various media platforms such as public websites. If the included content infringes on your rights, please contact us by email and we will deal with it as soon as possible.
Information

Musk's XAI artificial intelligence company is reportedly seeking $3 billion in financing, with a valuation of $18 billion

2024-4-7 9:42:36

Information

Sam Altman and former Apple design director practice developing AI devices and seek $1 billion in financing

2024-4-7 9:44:08

Search