OpenAI Chief Executive Officer Sam Altman In an interview, the use of high-qualitydataHe emphasized the importance of high-quality data to train artificial intelligence models, whether it is human-generated data or synthetic data.
Altman mentioned the need for high-quality data for AI systems in an interview at the AI for Good Global Summit. He believes that low-quality data, whether it is from humans or synthetically generated, can be a problem. "I think what you need is high-quality data," Altman said. "There is low-quality synthetic data, and there is low-quality human data."
Currently, OpenAI already has enough data to train the next generation of models after GPT-4, Altman said. The company is trying to generate large amounts of synthetic data to explore different AI training methods. However, the key question is how AI systems can learn more with less data, rather than just generating large amounts of synthetic data for training.
Altman believes it would be “very weird” if the best way to train a model was to “generate something like a quadrillion labeled synthetic data points and feed it back.” For Altman, learning from data efficiently is key, and he describes the core question as “how do you learn more with less data?” He cautions that OpenAI and other companies still need to find the data and methods that are best suited to training increasingly powerful AI systems.
Science supports Altman’s point, showing that better data leads to better AI performance. This also fits with OpenAI’s recent strategy of spending hundreds of millions of dollars to acquire training data from major publishers. In this rapidly evolving field, there is still a lot of scientific progress to be made in finding the best data and techniques for training AI systems.