MIT Technology Review once published an article on its official website stating that with theChatGPTThe continued popularity of large modelsTraining DataThe demand for large models is increasing. Large models are like a "network black hole" that constantly absorbs data, and eventually there will not be enough data for training.
The well-known AI research institute Epochai published a paper directly on the data training problem, pointing out thatBy 2026, the big models will have exhausted all high-quality data; by 2030-2050, they will have exhausted all low-quality data.;
By 2030-2060, all image training data will be consumed. (The data here refers to the original data that has not been labeled or polluted.)
Paper address: https://arxiv.org/pdf/2211.04325.pdf
In fact, the problem of training data has already emerged. OpenAI said that the lack of high-quality training data will be one of the important problems in developing GPT-5. It is like a human going to school. When your knowledge level reaches the doctoral level, showing you junior high school knowledge will not help you learn.
Therefore, in order to enhance GPT-5's learning, reasoning, and AGI general capabilities, OpenAIA "data alliance" has been established, hoping to collect private, long text, video, audio and other data on a large scale, so that the model can deeply simulate and learn human thinking and working methods..
Currently, organizations such as Iceland and the Free Law Project have joined the alliance, providing OpenAI with various data to help it accelerate model development.
In addition, as AI content generated by models such as ChatGPT, Midjourney, and Gen-2 enter the public network, this will cause serious pollution to the public data pool built by humans, and will lead to characteristics such as homogeneity and single logic, accelerating the process of high-quality data consumption.
High-quality training data is crucial for large model development
From a technical perspective, the large language model can be viewed as a "language prediction machine" that builds association patterns between words by learning from large amounts of text data, and then uses these patterns to predict the next word or sentence in the text.
Transformer is the mostFamous, one of the most widely used architectures, and ChatGPT and others have borrowed this technology.
In simple terms, large language models are "copycats", and they say what humans say. So when you use models such as ChatGPT to generate text, you will feel that the narrative patterns of these text contents have been seen before.
Therefore, the quality of the training data directly determines whether the structure learned by the large model is accurate. If the data contains a large number of grammatical errors, inappropriate wording, inaccurate punctuation, false content, etc., then the content predicted by the model will naturally contain these problems.
For example, if a translation model is trained but the data used is all fabricated and low-quality content, the content translated by AI will naturally be very poor.
This is why we often see many models with very small parameters but better performance and output capabilities than models with high parameters. One of the main reasons is the use of high-quality training data.
In the era of big models, data is king
Because of the importance of data, high-quality training data has become a valuable resource that companies such as OpenAI, Baidu, Anthropic, and Cohere must compete for, and has become the "oil" of the big model era.
As early as March this year, when China was still frantically researching large models, Baidu had already taken the lead in launching a generative AI product, Wenxin Yiyansheng, which was benchmarked against ChatGPT.
In addition to its strong R&D capabilities, Baidu's 20-year accumulation of vast Chinese corpus data through its search engine has been of great help and has played an important role in the multiple iterations of Wenxin Yiyan.Ahead of the curveOther domestic manufacturers.
High-quality data typically includes published books, literary works, academic papers, school textbooks,authorityMedia news reports, Wikipedia, Baidu Encyclopedia, etc., text, video, audio and other data that have been verified by humans over time.
However, research institutions have found that the growth of such high-quality data is very slow. For example, publishing houses need to go through tedious processes such as market research, first draft, editing, and re-examination, which takes several months or even years to publish a book. This data output speed lags far behind the growth of large model training data demand.
Judging from the development trend of large language models in the past four years,The annual growth rate of training data exceeds 50%. In other words, the amount of data needed to train the model doubles every year to achieve performance and function improvements..
Therefore, you will see that many countries and companies strictly protect data privacy and have formulated relevant regulations. On the one hand, it is to protect the privacy of users from being collected by third-party organizations and prevented from being stolen or abused;
On the other hand, it is to prevent important data from being monopolized and hoarded by a few institutions, so that there is no data available during technological research and development.
By 2026, high-quality training data may run out
To study the problem of training data consumption, Epochai researchers simulated the language and image data generated annually by the world from 2022 to 2100, and then calculated the total amount of this data.
We also simulated the data consumption rate of large models such as ChatGPT. Finally, we compared the data growth rate and the data consumption rate and came to the following important conclusions:
With the current rapid development trend of large models, all low-quality data will be consumed by 2030-2050; high-quality data will most likely be consumed by 2026.
By 2030-2060, all image training data will be consumed; by 2040, the functional iteration of large models may show signs of slowing down due to lack of training data.
The researchers used two models for their calculations:FirstA model is built that predicts when they will reach their consumption peak and average consumption by looking at the growth trends of the datasets actually used in the two fields of large language and image models and then extrapolating historical statistical data.
Second model:Predict how much new data will be generated globally each year in the future. The model is based on three variables: global population, Internet penetration rate, and the average amount of data generated by each Internet user each year.
At the same time, the researchers used United Nations data to fit a population growth curve, used an S-shaped function to fit the Internet usage rate, and made a simple assumption that the annual data output per person remained basically unchanged. Multiplying the three together can estimate the amount of new data in the world each year.
The model has accurately predicted the monthly output of Reddit (a well-known forum), so the accuracy is very high..
Finally, the researchers combined the two models to reach the above conclusions.
Researchers said that although this data is simulated and estimated and has certain uncertainties, it has sounded the alarm for the big model community, and training data may soon become a major bottleneck restricting the expansion and application of AI models.
AI manufacturers need to plan effective methods for data regeneration and synthesis in advance to avoid a cliff-like data shortage in the process of developing large models.
The source material for this article is from the MIT Technology Review official website and Epochai's paper. If there is any infringement, please contact us to delete it.