Why AI is Killing the Internet: Model Collapse and the Knowledge Commons

Contents

The open web ran on a fragile premise: that people would share what they know, for free, in public. For about two decades that premise held. Developers posted answers on Stack Overflow . Students argued on Reddit. Journalists broke stories that Google indexed. The result was a vast, searchable knowledge commons. AI did not just consume that commons. It’s now wrecking the conditions that built it.

This isn’t a wild claim or a Luddite gripe. It’s an economic collapse, on the record, playing out in real time, with hard knock-on effects for AI model quality. The story is worth knowing whether you write code, publish content, do research, or just use the web to learn.

The Death of the Knowledge Commons

The numbers are concrete, and they arrived fast. Stack Overflow, once the world’s largest coding knowledge base, saw question volume drop by about 78% between 2022 and 2025. The site that helped a generation of developers debug code and share fixes went from a thriving community to a platform doing mass layoffs inside three years. That isn’t a gradual decline. It’s a cliff edge.

Stack Overflow question volume over time showing a steep decline after ChatGPT’s November 2022 launch — Stack Overflow question volume from 2009 to 2025, showing the dramatic collapse following ChatGPT's release

Chart: Tomaž Weiss , data from StackExchange Data Explorer

Stack Overflow isn’t alone. Chegg , the student Q&A and tutoring service, saw its stock drop by about 99% in the same span. Its model, paying for access to human know-how, broke the moment students could get instant answers from ChatGPT for free. Publishers across the web lost about a third of their search traffic to AI-intercepted queries. Google’s AI Overviews and similar features answer the question without the user ever visiting the source.

Even Sam Altman, CEO of OpenAI, said in 2024 that real human content on the internet is getting harder to find, pushed aside by synthetic output. The “Dead Internet Theory,” once a niche claim that the web was already mostly bot noise, has moved from fringe to something researchers and journalists now document with real data.

This goes beyond nostalgia for old websites. These platforms weren’t just useful services. They were the substrate that AI trained on. Stack Overflow, Reddit, Wikipedia, and millions of personal blogs formed the corpus that made large language models work in the first place. As those platforms hollow out, the conditions for the next round of high-quality training data hollow out with them.

The Mechanics of Destruction: Severing the Incentive Loop

The collapse follows a clear economic logic. It’s worth walking through, because that logic is the reason this is a structural problem, not a short-term blip.

People share knowledge in public for reasons that aren’t purely kind. Stack Overflow posters earned reputation scores that signaled job-grade skill. Bloggers ran ads and affiliate links, which needed search traffic. Reddit mods built communities and social clout. Wikipedia editors joined a real shared project with cultural weight. These rewards weren’t side effects. They were the engine.

AI training pipelines harvested about 15 years of that work without giving anything back to the sources. The Common Crawl, the dataset behind most large language models, scraped billions of web pages: the pile of human expertise freely shared online. None of the platforms or individuals who made that content got payment, credit, or even a heads-up.

That extraction would be fine if AI had kept the traffic flowing to the platforms it scraped. Instead, it cut it off. When a developer searches for how to handle a Python exception and gets an AI-generated answer right in the search results, Stack Overflow gets no page view and no ad revenue. When a student asks ChatGPT to explain calculus, Chegg loses a subscriber. The platforms that produced the high-quality training data are now starved of the money they needed to run, moderate, and grow.

The incentive gap then compounds the loss. If a developer knows their Stack Overflow answer won’t be seen, because AI now owns discovery, they have less reason to write a careful, well-sourced reply. Why spend 45 minutes on a thorough answer for reputation points on a dying platform? The result is a feedback loop. Less traffic, less contribution, lower quality, more decline.

Model Collapse: The Science of Self-Consumption

Model collapse is what happens when this loop hits the training pipeline. In 2024, Oxford researchers published a paper in Nature showing that AI models trained on AI-made data decay in clear, predictable ways. The decay isn’t subtle or slow. It piles up across generations.

Rice University researchers coined a sharper name for it: Model Autophagy Disorder , or MAD. The picture is a photocopier copying a photocopy. The first copy is almost like the first. The second is a bit worse. By the tenth round, the text is blurry and the details are gone. With AI models, the “blur” shows up as output that grows more bland, more repetitive, and less able to handle odd or edge-case problems.

Diagram comparing model collapse under data replacement versus data accumulation, showing increasing test loss when models train on their own outputs — When synthetic data replaces real data across model generations, test loss increases sharply - but accumulating real data alongside synthetic data prevents collapse

What gets lost first tells the real story. Human writing has long tails: odd phrasings, niche know-how, creative leaps, off-script ways to solve problems. A clever fix for a hard engineering bug isn’t in the middle of the bell curve. It’s in the tail. When models train on synthetic data, those tails get cut off. Each new model gets better at confident, fluent, middle-of-the-curve output, and worse at the novel, edge-case reasoning that actually moves the field.

The upshot is an AI stack that grows more skilled at sounding sure of itself while losing its grip on real new insight. The output feels polished. The reasoning sounds clean. But the skill of drawing fresh links across varied human work shrinks with each training cycle. Production teams already live with these symptoms, and fixing LLM hallucinations in deployed systems is now a design problem, not just a prompting one.

The Data Wall: Exhausting the Internet

Model collapse is a quality problem. There’s also a quantity problem.

Epoch AI , a research group that tracks AI compute and data trends, has estimated that the stock of high-quality human-written text usable for training large language models could be used up by 2028. This isn’t a claim that the internet runs out of words. There will always be more text. It’s a claim that the pool of carefully written, factually grounded, useful human text is finite, and AI training is burning through it faster than humans can produce it.

Epoch AI projections showing the stock of human-generated public text approaching full utilization by 2028, with dataset sizes of notable LLMs plotted against the available data supply — Epoch AI's projections show LLM training dataset sizes converging with the total stock of human-generated public text around 2028

Chart: Epoch AI , CC-BY

The contamination problem then makes the squeeze worse. The fast growth of synthetic content makes it harder to filter “clean” human data for future training runs. Today’s AI-text detectors work fairly well against older, weaker models. As models improve, their output blends into human writing under automated checks. Future training sets will contain large shares of synthetic text that can’t be reliably spotted or stripped out.

So the data wall isn’t just a ceiling. It’s a speed-up. The more AI content floods the web, the harder it gets to find clean human signal, the more future models must lean on synthetic data, and the faster model collapse runs. It’s a compounding curve, not a straight line.

Epistemic Stagnation: A Future Without New Knowledge

The deepest worry here isn’t about AI model quality. It’s about the future of human knowledge itself.

When a developer hits an odd bug, say a rare clash between library versions, OS quirks, and hardware, and solves it with ChatGPT in a private chat, that fix stops existing for anyone else. The bug is patched. The problem goes away. The knowledge is never posted, never indexed, never available to the next developer who hits the same edge case. Under the old model, that developer would have posted a Stack Overflow question, gotten feedback, and added a searchable record to the commons. The knowledge would stick around and stay findable.

This is the broken knowledge loop. AI answers feel like a faster way to solve problems. In the short term, for the user, they often are. But every private AI chat that takes the place of a public forum post is knowledge that never lands in the public record. Across millions of developers and thousands of daily chats, the web’s stock of new fixes stalls. The security audits of AI-generated applications have started to put numbers on that quality gap: thousands of bugs in code that passed a quick review because the model sounded sure of itself.

There’s also a deeper issue about recombination versus discovery. Today’s AI systems recombine known knowledge with strong fluency, finding patterns across text and stitching coherent output. They can’t find new facts about the world that aren’t already implied by their training data. When the training data stops getting fresh human discovery, AI works on a fixed pool. AI systems train on AI output that itself trained on earlier AI output, all the way back to a human corpus that’s no longer being meaningfully updated. The model keeps producing confident answers. The knowledge base behind them sets like concrete.

The web starts to feel like a hall of mirrors. Authoritative-sounding answers point back to other authoritative-sounding answers, with the whole structure sitting on a foundation that stopped updating years ago.

Cautious Optimism and the Road Forward

None of this means the picture is settled. There are real technical responses worth tracking, even if none of them fully fix the problem.

Self-play and learning from a model’s own reasoning is the most likely near-term fix. Models like DeepSeek R1 showed that a model can make its own training signal by reasoning through problems step by step and checking outputs against ground truth in fields where checking is doable: math, coding, formal logic. If this scales, it could partly free model quality from the human text pool, at least in those checkable fields. The limit is real. Checking needs hard ground truth, and most human writing, on history, ethics, politics, and creative work, has no such ground truth.

Some history is worth holding here, though it shouldn’t be used to wave the current problem away. The internet did push aside encyclopedias, but it also gave us Wikipedia and GitHub. The printing press wrecked the scribal trade, and the result was mass literacy. Technologies that kill old knowledge formats sometimes build new ones nobody saw coming. Whether AI follows that pattern is an open question, and anyone who claims to know the answer with confidence is guessing.

Stack Overflow and similar platforms are already trying to adapt, folding AI tools into their workflows instead of pretending they can fight free AI chat on equal terms. Whether that works depends on whether they can rebuild the incentive stack, reputation, community, discovery, that made contribution worthwhile when search traffic still flowed.

The timing issue is harder to wave away. The Industrial Revolution caused severe disruption, but it played out over generations. Communities had time to adapt, move, retrain. The current shift in how knowledge is made and consumed is measured in quarters, not decades. The knowledge commons that took 20 years to build could erode badly inside five. That isn’t a comfortable timeline for adaptation, for the platforms trying to pivot, or for AI models that need fresh human-written data to dodge the compounding effects of self-consumption.

What’s being lost is important well beyond the platforms themselves. The value of publicly shared human expertise was never just about having answers on tap. It was about a constantly updated, collectively kept record of how humans are actually solving problems right now. When that record stops updating, AI models trained on it will slowly produce answers about a world that no longer works the way they describe.