Recently, a survey revealed a number of factors includingappleA number of tech giants, includingYoutubevideo subtitles to train AI models. The data covers more than 170,000 videos, including content from well-known creators such as MKBHD and Mr. Beast. Apple used this data to train its open source modelsOpenELMThe model was released in April of this year.
In response, Apple recently clarified that OpenELM is not used in any of its AI or machine learning capabilities, including Apple Intelligence, and emphasized that OpenELM was developed to contribute to the research community and to advance open source large language models. Previously, Apple researchers have described OpenELM as "the most advanced open language model".
Apple says OpenELM is for research purposes only and does not support any Apple Intelligence features. The model is released as open source and is available on Apple's machine learning research site, which means the "YouTube subtitles" dataset is not being used to support Apple Intelligence. This means that the "YouTube Subtitles" dataset is not being used to support Apple Intelligence, which Apple has previously stated is "trained on licensed data, including data selected for specific features and publicly available data collected by web crawlers.
It's worth noting that Apple has no plans to develop a new version of OpenELM. Wired magazine reports that in addition to Apple, companies such as Anthropic and NVIDIA have also used the "YouTube subtitles" dataset to train their AI models. The dataset is part of the non-profit organization EleutherAI's large-scale dataset "The Pile".
This incident has sparked a discussion about the source of AI training data and its impact on privacy and copyright. Despite Apple's clarification of OpenELM's use, the practice of tech companies using publicly available data to train AI models remains a concern.