AI Model Collapse
In a recent Forbes article, Bernard Marr walks us through a looming problem – model collapse. As detailed in a Nature article, model collapse occurs when we train AI models on data that includes earlier versions of content. It claims that, eventually, models will drift further away from the original data distribution and no longer accurately represent the world. At this point, AI begins to make mistakes that compound and lead to distorted and unreliable outputs. Marr points out that model collapse could have profound implications for businesses, technology, and our entire digital ecosystem.
Most AI models, including GPT-4, are trained on data from the internet, which is initially generated by humans and reflects the diversity and complexity of human language, behavior, and culture. The challenge begins when the next-gen AI model is trained on human-generated data and data created by earlier AI models. AI then begins to “learn” from its own outputs, which are not perfect, and the model’s understanding of the world gradually begins to degrade.
Implications are far-reaching. If AI models continue to train on AI-generated data, there may be a decline in the quality of everything from automated customer service to online content and even financial forecasting.
Preventing model collapse requires ensuring that AI continues to be trained on high-quality, human-generated data. The paradox is that AI needs human data to function effectively, but the internet is becoming flooded with AI-generated content.
Using human data involves significant ethical and legal challenges, including who owns the data. Do individuals have rights over the content they create, and can they object to its use in training AI?
Initial models trained on purely human-generated data are likely to be the most accurate and reliable, creating an opportunity for early adopters of the technology. As more and more AI-generated content floods the internet, future models will be at greater risk of collapse, and the advantages of using AI will diminish.
Marr lays out the solution: ensuring that AI models continue to learn from diverse, authentic human experiences is essential to preserving their accuracy and relevance. He also explains that greater transparency and collaboration within the AI community is needed. By sharing data sources, training methodologies, and the origins of content, AI developers can help prevent the inadvertent recycling of AI-generated data. Businesses and AI developers should also consider integrating periodic “resets” into the training process. Reintroducing models to fresh, human-generated data counteracts the gradual drift that leads to model collapse.