OpenAI Recently launched the hot text-to-video generation model SoraHowever, the company's chief technology officer (CTO) Mira Murati in an interview with the Wall Street Journal but was too incoherent to specify Sora'sTraining DataSource.
In the interview, when Murati was asked directly about the source of Sora's training data, the reporter saidShe prevaricated only with the vague official line, "We use publicly available data and licensed data."
When pressed by a reporter to find out if specific sources included YouTube videos, theMurati went so far as to say, "I'm actually not sure about that.", and declined to answer questions about whether Instagram or Facebook videos were included in the training set. She argued that if the videos were publicly available and usable, then they might have been used, theBut she herself is not sure about that.
When asked if OpenAI has ever worked with its partner Shutterstock on data training, Murati declined to discuss the source of the data.
Murati even dodged a reporter's question about OpenAI's data partnership with stock photo giant Shutterstock, refusing to say whether videos from the platform were used to train Sora. ultimately, she simply cut off the discussion, insisting that the source of the data was "definitely publicly available or licensed" but not giving any specifics. She was unable to give any specifics.
Murati's blink-and-you'll-miss-it approach puts OpenAI in an awkward position. The company has previously been the subject of widespread controversy over its data-scraping behavior, and has even faced a number of copyright lawsuits, including one from the New York Times. Now.Even the CTO couldn't say where the training data for its most popular models came from, raising questions about how seriously OpenAI executives are taking the issue.
After the interview, Murati reportedly admitted privately that he did use Shutterstock videos to train Sora, but the material from Shutterstock is likely to be only a small portion of Sora's training data compared to the vast amount of video content available on the web.
Murati's secrecy has sparked a lot of debate among the netizens. Many felt she lacked candor and questioned her knowledge of her own product. Some said outright that it was unbelievable that the CTO would be uninformed about such a critical issue.
However, there are those who defend Murati by arguing that since the content has been posted to the web, AI companies should be allowed to exploit it. They argue that users should take the risk of having their content used since they choose to make it public.
It remains to be seen whether Murati's evasive behavior was an attempt to prevent further copyright disputes, or whether it was truly ignorant of the source of the data. What is certain is that the public has a right to question where this "publicly available and licensed" AI training data came from. In the future, I'm afraid that vague official statements will be hard to quell people's doubts.