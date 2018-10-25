from the a-copy-of-a-copy-of-a-copy dept.
The meteoric rise of artificial intelligence but it's facing a shortage of training data:
"We've already run out of data," Neema Raphael, Goldman Sachs' chief data officer and head of data engineering, said on the bank's "Exchanges" podcast published on Tuesday.
Raphael said that this shortage may already be influencing how new AI systems are built.
He pointed to China's DeepSeek as an example, saying one hypothesis for its purported development costs came from training on the outputs of existing models rather than entirely new data.
[...] With the web tapped out, developers are turning to synthetic data — machine-generated text, images, and code. That approach offers limitless supply, but also risks overwhelming models with low-quality output or AI slop.
However, Raphael said he doesn't think the lack of fresh data will be a massive constraint, in part because companies are sitting on untapped reserves of information.
Rick Beato talked about [15:29 --JE] how he broke ChatGPT with a simple question and exposed the gaps in AI's "knowledge" that are filled with synthetic data.
« New Psychology Research Looks at Why We Help Our Friends When They Need It | Windows 11 Update Breaks Localhost, Prompting Mass Uninstall Workaround »
Like you, I'm sick to the back teeth of talking about AI. Like you, I keep getting dragged into discussions of AI. Unlike you, I spent the summer writing a book about why I'm sick of writing about AI, which Farrar, Straus and Giroux will publish in 2026.
A week ago, I turned that book into a speech, which I delivered as the annual Nordlander Memorial Lecture at Cornell, where I'm an AD White Professor-at-Large. This was my first-ever speech about AI and I wasn't sure how it would go over, but thankfully, it went great and sparked a lively Q&A. One of those questions came from a young man who said something like "So, you're saying a third of the stock market is tied up in seven AI companies that have no way to become profitable and that this is a bubble that's going to burst and take the whole economy with it?"
I said, "Yes, that's right."
He said, "OK, but what can we do about that?"
So I re-iterated the book's thesis: that the AI bubble is driven by monopolists who've conquered their markets and have no more growth potential, who are desperate to convince investors that they can continue to grow by moving into some other sector, e.g. "pivot to video," crypto, blockchain, NFTs, AI, and now "super-intelligence." Further: the topline growth that AI companies are selling comes from replacing most workers with AI, and re-tasking the surviving workers as AI babysitters ("humans in the loop"), which won't work. Finally: AI cannot do your job, but an AI salesman can 100% convince your boss to fire you and replace you with an AI that can't do your job, and when the bubble bursts, the money-hemorrhaging "foundation models" will be shut off and we'll lose the AI that can't do your job, and you will be long gone, retrained or retired or "discouraged" and out of the labor market, and no one will do your job. AI is the asbestos we are shoveling into the walls of our society and our descendants will be digging it out for generations:
The only thing (I said) that we can do about this is to puncture the AI bubble as soon as possible, to halt this before it progresses any further and to head off the accumulation of social and economic debt. To do that, we have to take aim at the material basis for the AI bubble (creating a growth story by claiming that defective AI can do your job).
(Score: 2) by mrpg on Sunday October 19, @02:40PM (4 children)
A summary of the 15 minutes video, plis? 52!, ok, and then? Broke how? I only saw 3 minutes.
(Score: 3, Informative) by jelizondo on Sunday October 19, @03:56PM (3 children)
The video is quite interesting, much more than TFA.
There are two threads: 1) AI can't says a large number (52!), it can calculate but just can't speak it out. 2) The guy asks about music (recording, mixing, arranging) and AI replies it knows all about it because it has studied the great in the business. The guy says it can't know what AI is talking about because there are no interviews or other materials in the Internet on how great sound engineers, composers, etc. do their work, so AI can't have studied it.
(Score: 5, Insightful) by mcgrew on Sunday October 19, @08:41PM (2 children)
The video is quite interesting
I'm not a good video learner. People never speak correctly, have hard to understand accents, misspeak, and are basically shitty vocal communicators, while the written word is usually at least clear. Videos are for demonstrating motion. I neither need nor want talking heads, I'm literate and can read ten times as fast as you can talk unless you're an auctioneer, and who can underatand THEM?
(Score: 2) by jelizondo on Sunday October 19, @10:04PM (1 child)
I agree with you. When I want to learn something, I want to read it. Video just doesn't have that power: you can stop reading and think about what you just read, you can go back re-read it, etc. without having to do anything other than move your eyes.
What I meant as "interesting" is that the guy gives actual examples of AI inability and lies about subject about which it can't really know anything.
(Score: 0) by Anonymous Coward on Sunday October 19, @11:15PM
Words just don't do it as well.
As for reading, for youtube videos sometimes the transcript part is good enough to figure stuff out quickly, despite the transcription errors.
(Score: 5, Insightful) by Mojibake Tengu on Sunday October 19, @02:52PM (3 children)
Obviously, if AI will keep reproducing themselves by a trivial copy-merge method like humans do, there is a high risc of generations degradation similar to human close relatives progeny, now forbidden in human cultures because of socially observed collapse of intellectual capacity.
Problem is not in data, but in method of its combination, which naturally amplifies divergences.
I'd suggest a humble return to logical artificial intellects like Logic Programming and keep generative LLMs only for toys.
Of course that means to bring up next generations of prgrammers who understand logic. Rationalism is not possible with current human population.
(Score: 3, Touché) by mcgrew on Sunday October 19, @08:43PM
Problem is not in data, but in method of its combination
The problem is that AI cannot make anything out of nothing like humans can.
(Score: 2, Interesting) by Anonymous Coward on Monday October 20, @12:34AM
Inbreeding can strengthen traits:
https://en.wikipedia.org/wiki/Ashkenazi_Jews [wikipedia.org]
https://en.wikipedia.org/wiki/John_von_Neumann#Mathematical_quickness [wikipedia.org]
Some traits might be unwanted of course:
https://en.wikipedia.org/wiki/Medical_genetics_of_Jews [wikipedia.org]
(Score: 2) by driverless on Monday October 20, @04:18AM
This has been known for awhile, it's called model collapse. We've already seen signs of it for a year or so now, just patched over or around by rerunning the query or tweaking the model.
(Score: 4, Informative) by looorg on Sunday October 19, @03:06PM (1 child)
So AI is now going to eat its out output? Great. I hope they choke on it. This will be funtastic. Cause copies-of-copies-of-copies have never gone wrong before ... It's like animals that mate with siblings. Output not great. It should be easier then ever to spot them cause eventually they won't even sound, or write, like humans anymore. They copy themselves and won't be able to tell the difference.
(Score: 3, Funny) by Gaaark on Sunday October 19, @03:36PM
Yup...just look at Prince Charles! (HE'S NOT MUH KING!)... and
PrinceAndrew
(Score: 5, Insightful) by pTamok on Sunday October 19, @03:45PM (1 child)
There's a concept in statistics: Regression to(wards) the mean [wikipedia.org].
Given that the output of LLMs is the statistically most likely following text from the prompt, feeding back results as input will reinforce the statistically most likely text - in other words, relatively rare text will become rarer; so what you will get is not necessarily the correct answers to questions, but the most frequently occurring answers, reinforced by repetition, which could well be wrong.
(Score: 2, Insightful) by Anonymous Coward on Sunday October 19, @06:09PM
Sounds just like humans, tell a lie often and forcefully enough, and they will believe, and kill anybody that doesn't
(Score: 4, Touché) by Thexalon on Sunday October 19, @04:42PM (8 children)
The AI bots have read the entire Internet's worth of data, and still aren't super-geniuses, so now we're going to have some AI slop generators invent data to be read by AI readers to make more AI generative slop, which can then be read by AI readers, which can then be used to generate more slop, ... And around and around and around we go, and at no point has anybody ever proven that this will accomplish anything.
That seems like a really useful purpose for $160 billion in the US alone.
(Score: 2) by looorg on Sunday October 19, @04:49PM (3 children)
Sounds about right. If they are running out of training data. Does that not mean that they have then consumed all of human knowledge? All the accessible human knowledge written down. And this is the best we are going to get? It's not really that impressive in that regard, or we are not that impressive to put things into perspective. Which is why I though that this will be funtastic or utterly horrible as we will now see what happens when AI/LLM basically decided to become the human centipede (don't look it up, once seen it can't be unseen) and eating their own output. It's going to be slop all the way down from here ...
(Score: 3, Insightful) by corey on Sunday October 19, @08:42PM
Well, all the human knowledge that’s written on the internet. That’s not much. All the good stuff is written in physical books. I suppose that some LLMs have had access to ebooks, given the lawsuits, too.
(Score: 0) by Anonymous Coward on Monday October 20, @12:41AM
If they are running out of training data it means they're on the wrong path. If you need zillions of data to learn stuff, you're not learning stuff, you're just memorizing possible answers and using statistics and heuristics to guess the right ones.
I bet you do not need a million samples of data to teach a smart[1] dog or even a crow the difference between a car and a bus. And when a new bus appears, if it's still buslike enough, it would be considered a bus.
[1] Some dogs are stupid, so let's exclude those for this example...
(Score: 2) by VLM on Monday October 20, @02:09PM
In all fairness, you look at something like a new uni grad, even a "Great Books Curriculum" higher ed grad, who has trained for years on the best books even with some human help along the way, and on average they're not that impressive. No surprise thats what we get from LLMs.
All human progress comes from very few people indeed. Groups can take credit for individual progress but all progress is individual.
(Score: 5, Informative) by Rosco P. Coltrane on Sunday October 19, @06:29PM
Humans are much more energy-efficient at being dumbasses: they only need to read a teeny tiny fraction of the shittiest parts of the internet.
(Score: 0) by Anonymous Coward on Sunday October 19, @10:45PM (2 children)
> The AI bots have read the entire Internet's worth of data ....
If the LLM developers really think that "the internet" is somehow equivalent to "human knowledge", then they deserve to fail, and fail hard. It's not just books (as someone else mentioned), it's huge amounts of common sense in experienced people, and a vast amount of trade secrets that most companies guard carefully.
The useful future of the current round of "AI" technology isn't in giant LLMs that are currently getting all the funding and press. The future is in domain specific problem solving tools that work alongside people. A recent presentation I saw used Alpha Fold as an example, it has greatly sped up protein folding research, leading to all sorts of potentially useful and profitable results. A small company I just visited, has been making their own little "AI" models, to help solve recurring problems that come up in their business.
(Score: 3, Interesting) by Thexalon on Monday October 20, @01:15AM
Machine learning has been around for a long time for all kinds of specialized tasks, and you're right that applied correctly as a technique it can be a really good one.
That's not what's being marketed as "AI" though. And that's not the promises being spouted by the likes of Sam Altman.
(Score: 1) by pTamok on Monday October 20, @08:31AM
The was a project, some time ago, decades before the AI hype of LLMs, where someone was trying to build a repository of 'common sense' by getting people to document such common sense in 'bite-size' chunks of text.
Ah, here we are: Doug Lenat's Cyc [wikipedia.org], started in 1984,
It's another approach to artificial thinking aids, and required heavy curation of its input - 'hand axiom writing'. It hasn't had 'overnight success', yet, and may never do. But it illustrates the difficulty of encoding 'common sense'. Furthermore, it still misses the point by a country mile: expecting knowledge to be completely accessible via a text-based medium is likely an insufficiently rich approach. Many people contrast 'book-leaning' with 'learning by doing' - you can read all you like about how to ride a bicycle, but you need real world experience and practice to actually be able to do so. The same applies in many areas of knowledge, which are currently completely inaccessible to LLMs. Knowing how to ride a bicycle is a trivial existence proof of knowledge inaccessible to text-based approaches - but the point is that real world physical process-based knowledge is both important, and hard. My robot launderer is very good, but refuses to check the pockets for things that should not be washed, and will happily wash a non-colourfast item with my whites. It doesn't learn from experience.
There's more to intelligence than a comprehensive knowledge-base, and we have collectively spent a great deal of money and resources exhaustively demonstrating that.
(Score: 5, Informative) by mcgrew on Sunday October 19, @08:36PM
With the web tapped out, developers are turning to synthetic data
In Non-marketspeak, With the web tapped out, developers are turning to FICTION. Just what you need in your next sales report, or medical dignosis, or CIA data...
(Score: 2) by sonamchauhan on Monday October 20, @03:28AM
There is data there; they'd have to work at getting the artifacts:
- OCR'ed notebooks of schoolkids, from the Victorian era to now
- CIA, FBI and police casefiles
- hospital records
- personal diaries
- Alexa logs
- ICQ, AOL and BBS chat logs
- all proprietary software ever created (perhaps with source code decompiled from binary, but binary works unless encrypted)
- VLSI design files
- pyrolysed manuscripts from excavations at Pompeii. Decoded via MRI scans
- Nuclear and mitochondrial DNA scans of bacteria (human DNA ingested already)
- viral RNA
- protein structure databases
- historical star charts and astronomical observations
- ongoing ramblings at Soylentnews and similae (those too poorly formatted and whimsical to be generated by AI, such as mine)
- chess and Go game databases.
(Score: 2) by VLM on Monday October 20, @02:06PM
Hire people to generate data.
That'll be expensive compared to "steal all the historical data for free"
Here's an interesting analogy similar to some "why are there no space aliens here on earth yet?"
Its a natural ecological event for more creative content to be easily accessible online, until there's enough that LLM generates, then because its expensive to generate content there's a population crash where infinite LLM generated temporarily profitable spam floods the online until even the LLMs die off.
Kind of like you can take a barrel of water, and slowly add sugar, maybe for a VERY long time, but when yeast is either created or invented or evolved or introduced, BAM in a matter of days all that sugar is permanently gone and you get ethanol (later acetic acid vinegar if you also allow some oxygen) Then when the barrel is full of "rot" (assuming you don't like grain alcohol) the sugar stops being added because its wasted and unusable.
Thats the future of the internet.
The future being very unevenly distributed you can kind of see that now on legacy networks like Usenet. Visiting social spaces on the internet in 2030s and 2040s will probably feel much like trying to visit Usenet this century to socialize...