AI Language Models Facing Data Shortage by 2032
A new study by research group Epoch AI has projected that artificial intelligence systems like ChatGPT could soon face a shortage of the vast amount of text data that has been crucial in making them smarter. The study suggests that tech companies will exhaust the publicly available training data for AI language models by the turn of the decade, between 2026 and 2032.
Describing the situation as a “literal gold rush” that depletes finite natural resources, Tamay Besiroglu, an author of the study, highlighted the challenges the AI field might encounter in maintaining its current pace of progress once the reserves of human-generated writing run out.
In response to this impending data shortage, tech companies like OpenAI and Google are racing to secure high-quality data sources to train their AI models. However, the study warns that there may not be enough new content, such as blogs, news articles, and social media commentary, to sustain the current trajectory of AI development.
The study, which will be presented at the International Conference on Machine Learning in Vienna, Austria, also raises concerns about the reliance on synthetic data and the potential risks of overtraining AI models on limited sources.
As the AI industry grapples with the data bottleneck, experts emphasize the importance of finding alternative ways to improve AI systems without solely relying on vast amounts of text data. The study underscores the need for innovative solutions to ensure the continued advancement of AI technology in the face of data scarcity.