AI models trained on AI-generated data could spiral into unintelligible nonsense, scientists warn

image

Artificial Intelligence (AI) systems could slowly trend toward filling the internet with incomprehensible nonsense, new research has warned. 

AI models such as GPT-4, which powers ChatGPT, or Claude 3 Opus rely on the many trillions of words shared online to get smarter, but as they gradually colonize the internet with their own output they may create self-damaging feedback loops.

The end result, called “model collapse” by a team of researchers that investigated the phenomenon, could leave the internet filled with unintelligible gibberish if left unchecked. They published their findings July 24 in the journal Nature.

“Imagine taking a picture, scanning it, then printing it out, and then repeating the process. Through this process the scanner and printer will introduce their errors, over time distorting the image,” lead author Ilia Shumailov, a computer scientist at the University of Oxford, told Live Science. “Similar things happen in machine learning — models learning from other models absorb errors, introduce their own, over time breaking model utility.”

AI systems grow using training data taken from human input, enabling them to draw probabilistic patterns from their neural networks when given a prompt. GPT-3.5 was trained on roughly 570 gigabytes of text data from the repository Common Crawl, amounting to roughly 300 billion words, taken from books, online articles, Wikipedia and other web pages.

Related: ‘Reverse Turing test’ asks AI agents to spot a human imposter — you’ll never guess how they figure it out

But this human-generated data is finite and will most likely be exhausted by the end of this decade. Once this has happened, the alternatives will be to begin harvesting private data from users or to feed AI-generated “synthetic” data back into models.

To investigate the worst-case consequences of training AI models on their own output, Shumailov and his colleagues trained a large language model (LLM) on human input from Wikipedia before feeding the model’s output back into itself over nine iterations. The researchers then assigned a “perplexity score” to each iteration of the machine’s output — a measure of its nonsensicalness.

As the generations of self-produced content accumulated, the researchers watched their model’s responses degrade into delirious ramblings. Take this prompt, which the model was instructed to produce the next sentence for:

“some started before 1360 — was typically accomplished by a master mason and a small team of itinerant masons, supplemented by local parish labourers, according to Poyntz Wright. But other authors reject this model, suggesting instead that leading architects designed the parish church towers based on early examples of Perpendicular.”

By the ninth and final generation, the AI’s response was:

“architecture. In addition to being home to some of the world’s largest populations of black @-@ tailed jackrabbits, white @-@ tailed jackrabbits, blue @-@ tailed jackrabbits, red @-@ tailed jackrabbits, yellow @-.”

The machine’s febrile rabbiting, the researchers said, is caused by it sampling an ever narrower band of its own output, creating an overfitted and noise-filled response.

For now, our store of human-generated data is large enough that current AI models won’t collapse overnight, according to the researchers. But to avoid a future where they do, AI developers will need to take more care about what they choose to feed into their systems. 

This doesn’t mean doing away with synthetic data entirely, Shumailov said, but it does mean it will need to be better designed if models built on it are to work as intended.

“It’s hard to tell what tomorrow will bring, but it’s clear that model training regimes have to change and, if you have a human-produced copy of the internet stored … you are better off at producing generally capable models,” he added. “We need to take explicit care in building models and make sure that they keep on improving.”

This post was originally published on Live Science

Share your love