Blogger Matt Webb point out that nations have begun to need a strategic fact reserve, in light of the problem arising from LLMs and other AI models starting to consume and re-process the slop which they themselves have produced.
The future needs trusted, uncontaminated, complete training data.
From the point of view of national interests, each country (or each trading bloc) will need its own training data, as a reserve, and a hedge against the interests of others.
Probably the best way to start is to take a snapshot of the internet and keep it somewhere really safe. We can sift through it later; the world's data will never be more available or less contaminated than it is today. Like when GitHub stored all public code in an Arctic vault (02/02/2020): a very-long-term archival facility 250 meters deep in the permafrost of an Arctic mountain. Or the Svalbard Global Seed Vault.
But actually I think this is a job for librarians and archivists.
What we need is a long-term national programme to slowly, carefully accept digital data into a read-only archive. We need the expertise of librarians, archivists and museums in the careful and deliberate process of acquisition and accessioning (PDF).
(Look and if this is an excuse for governments to funnel money to the cultural sector then so much the better.)
It should start today.
Already, AI slop is filling the WWW and starting to drown out legitimate, authoritative sources through sheer volume.
Previously
(2025) Meta's AI Profiles Are Already Polluting Instagram and Facebook With Slop
(2024) Thousands Turned Out For Nonexistent Halloween Parade Promoted By AI Listing
(2024) Annoyed Redditors Tanking Google Search Results Illustrates Perils of AI Scrapers
(Score: 4, Insightful) by DadaDoofy on Tuesday January 21 2025, @02:26PM (3 children)
"trusted, uncontaminated, complete training data."
And just how would this be accomplished? Anything done by humans is inherently biased. Why would the curation of AI training data be any different?
AI is being sold as some kind of impartial arbiter of knowledge, but that couldn't be further from the truth. The sooner people catch on, the sooner this bubble will pop and we can move on.
https://www.discovermagazine.com/technology/ai-systems-reflect-the-ideology-of-their-creators-say-scientists [discovermagazine.com]
(Score: 0, Flamebait) by Freeman on Tuesday January 21 2025, @03:02PM
That's okay, they'll just train the AI that holds the keys to the Nuclear Launch Codes on the likes of "A Modest Proposal" and "Mein Kampf".
Joshua 1:9 "Be strong and of a good courage; be not afraid, neither be thou dismayed: for the Lord thy God is with thee"
(Score: 3, Disagree) by VLM on Tuesday January 21 2025, @06:43PM (1 child)
Its the classic echo-chamber subculture problem where a weird small group (perhaps a LLM) feeds upon its own output until its space alien time, unable to interact with the rest of reality.
For a human-ish example, look how weird Reddit posters are, or at least can be, compared to functioning humans IRL. The TDS salt there has been hilarious in recent days LOL.
Real human generated data costing nothing, and AI slop output costing near nothing, the usual people will try to cut corners by feeding the output back into the input and it's not going to turn out very well.
At some point a disconnected groupthink subculture becomes useless to the larger culture that's theoretically funding it based on the idea that it'll be useful. For example, could a LLM trained on itself until it's indistinguishable from 4chan complete with n-word stacks and rule 34 memeposting write useful text for ... anything else? Probably not.
Then too there are other philosophical arguments like social communities always go right wing without massive centralized censorship and control, so a LLM feeding back on itself will inevitably go full Mein Kampf. Now if that's actually bad or not is open for debate, if its the natural inevitable evolution of thought, but some folks, mostly from the left, don't like that factual interpretation of reality.
(Score: 2) by VLM on Tuesday January 21 2025, @06:47PM
... Real human generated data costing much more than nothing, and AI slop output costing near nothing...