Whole networkHigh-quality dataset emergency! It has been reported thatAI Companiesas if OpenAI, Anthropic and others are struggling to find enough information to train the next generation of AI models. The growing problem of data shortage is critical for training the next generation of powerful models. Faced with this challenge, AI startups, internet giants are beginning to look for new ways to solve the bottleneck of arithmetic and data.
Source Note: The image is generated by AI, and the image is authorized by Midjourney
It is reported that,GPT-5The development of powerful systems such as these requires large amounts of massive data as training material, yet high-quality public data has become scarce in the Internet.
Pablo Villalobos, a researcher at the research institute Epoch, estimates that GPT-4 is found in as many as 12 trilliontokentrained on it. He went on to say that based on the principles of Chinchilla's scaling law, an AI system like GPT-5 would require 60 trillion-100 trillion tokens of data if it continued to follow this scaling trajectory. That is, after utilizing all the available high quality most linguistic and image data, training out GPT-5 is still short of 20 trillion tokens.
Some data owners, such as Reddit, have also instituted policies restricting AI companies' access to data, exacerbating the data shortage dilemma. To address this dilemma, some companies are trying to train models from synthetic data, but may face issues such as 'model autophagy disorder'.
AI researchers and companies are looking for solutions to the problem of data scarcity. openAI's Ari Morcos notes that data shortages are a cutting-edge research problem, and his company, DatologyAI, is working to improve data selection tools to reduce the cost of training AI models. In addition, OpenAI is discussing the creation of a 'data marketplace' that would help alleviate the data shortage by determining how much data points contribute to model training.
Data shortages pose a major challenge to AI development, and companies are exploring different ways to address the problem. From synthesizing data to creating data marketplaces, the AI field is constantly looking for breakthroughs to secure the data resources needed to train the next generation of powerful AI models.