OpenAIrecently announced the Data Partnerships program, which aims to work with third-party organizations to create programs forAI model trainingof public and private datasets. This initiative aims to address problems with existing AI model training datasets that contain toxic language and bias.
OpenAI's goal is to produce AI that is safer and more beneficial to all humans, and to achieve this they plan to collect "large-scale" datasets that reflect human society, especially those that are currently difficult to obtain online. The data will cover a wide range of formats, including images, audio and video, but the focus will be on finding data that expresses human intent, such as lengthy writing or conversations, across different languages, topics and formats.
OpenAI has also committed to working with partner organizations to digitize training data using optical character recognition and automatic speech recognition tools, if necessary, and to remove sensitive or personal information when necessary. Initially, they plan to create two types of datasets:a publicly available dataset that anyone can use for AI model training, and a private dataset for training proprietary AI models for use by organizations that wish to protect data privacy.
Despite OpenAI's ambitious goals, some have questioned its business motives. Some have argued that OpenAI's initiatives, which are designed to improve the performance of its models, may be detrimental to the interests of other organizations and do not provide reasonable compensation to data owners. This has led to discussions about transparency and rights to use data.
OpenAI's Data Partnerships program aims to advance AI models, but its implementation and impact remains to be seen. Whether OpenAI can do better in overcoming challenges such as dataset bias remains to be proven.