Recently,AI Big ModelTraining DataThe shortage issue has once again become the focus of media attention. The Economist magazine's latest article "AI companies will soon use up most of the Internet's data" has sparked widespread discussion in the industry. The article points out that as high-quality Internet data is depleted, the AI field is facing the challenge of a "data wall".
Research firm Epoch AI predicts that all high-quality text data on the Internet will be exhausted by 2028, and machine learning datasets may run out of all "high-quality language data" by 2026. This "data wall" phenomenon has become a major problem for AI companies and may slow down their training progress.
Source Note: The image is generated by AI, and the image is authorized by Midjourney
The industry has long warned about this problem. In July 2023, UC Berkeley professor Stuart Russell warned that AI-driven robots such as ChatGPT may soon "exhaust the text in the universe." However, there are different opinions. In May 2024, Stanford University professor Fei-Fei Li said that there is still a large amount of differentiated data waiting to be mined to build more customized models.
To cope with data shortages, using synthetic data has become a potential solution. However, a recent paper published in Nature magazine pointed out that using AI-generated data sets to train future generations of machine learning models may lead to "model collapse" and cause the model to misunderstand reality. The research team recommends retaining some original data in training data, using diversified data sources, and studying more robust training algorithms.
How to break through the "data wall" limitation and ensure the continuous supply of high-quality training data has become an urgent issue in the AI industry. This requires not only technological innovation, but also the joint efforts of governments, enterprises and research institutions. As AI technology is increasingly integrated into all walks of life, solving the problem of data shortage will have a profound impact on the continued healthy development of AI.