Stories
Slash Boxes
Comments

SoylentNews is people

SoylentNews is powered by your submissions, so send in your scoop. Only 17 submissions in the queue.
posted by jelizondo on Sunday October 19, @02:22PM   Printer-friendly
from the a-copy-of-a-copy-of-a-copy dept.

The meteoric rise of artificial intelligence but it's facing a shortage of training data:

"We've already run out of data," Neema Raphael, Goldman Sachs' chief data officer and head of data engineering, said on the bank's "Exchanges" podcast published on Tuesday.

Raphael said that this shortage may already be influencing how new AI systems are built.

He pointed to China's DeepSeek as an example, saying one hypothesis for its purported development costs came from training on the outputs of existing models rather than entirely new data.

[...] With the web tapped out, developers are turning to synthetic data — machine-generated text, images, and code. That approach offers limitless supply, but also risks overwhelming models with low-quality output or AI slop.

However, Raphael said he doesn't think the lack of fresh data will be a massive constraint, in part because companies are sitting on untapped reserves of information.

Rick Beato talked about [15:29 --JE] how he broke ChatGPT with a simple question and exposed the gaps in AI's "knowledge" that are filled with synthetic data.

Related: The Real (Economic) AI Apocalypse is Nigh


Original Submission

Related Stories

The Real (Economic) AI Apocalypse is Nigh 54 comments

From Cory Doctorow's blog:

Like you, I'm sick to the back teeth of talking about AI. Like you, I keep getting dragged into discussions of AI. Unlike you, I spent the summer writing a book about why I'm sick of writing about AI, which Farrar, Straus and Giroux will publish in 2026.

A week ago, I turned that book into a speech, which I delivered as the annual Nordlander Memorial Lecture at Cornell, where I'm an AD White Professor-at-Large. This was my first-ever speech about AI and I wasn't sure how it would go over, but thankfully, it went great and sparked a lively Q&A. One of those questions came from a young man who said something like "So, you're saying a third of the stock market is tied up in seven AI companies that have no way to become profitable and that this is a bubble that's going to burst and take the whole economy with it?"

I said, "Yes, that's right."

He said, "OK, but what can we do about that?"

So I re-iterated the book's thesis: that the AI bubble is driven by monopolists who've conquered their markets and have no more growth potential, who are desperate to convince investors that they can continue to grow by moving into some other sector, e.g. "pivot to video," crypto, blockchain, NFTs, AI, and now "super-intelligence." Further: the topline growth that AI companies are selling comes from replacing most workers with AI, and re-tasking the surviving workers as AI babysitters ("humans in the loop"), which won't work. Finally: AI cannot do your job, but an AI salesman can 100% convince your boss to fire you and replace you with an AI that can't do your job, and when the bubble bursts, the money-hemorrhaging "foundation models" will be shut off and we'll lose the AI that can't do your job, and you will be long gone, retrained or retired or "discouraged" and out of the labor market, and no one will do your job. AI is the asbestos we are shoveling into the walls of our society and our descendants will be digging it out for generations:

The only thing (I said) that we can do about this is to puncture the AI bubble as soon as possible, to halt this before it progresses any further and to head off the accumulation of social and economic debt. To do that, we have to take aim at the material basis for the AI bubble (creating a growth story by claiming that defective AI can do your job).

This discussion was created by jelizondo (653) for logged-in users only, but now has been archived. No new comments can be posted.
Display Options Threshold/Breakthrough Mark All as Read Mark All as Unread
The Fine Print: The following comments are owned by whoever posted them. We are not responsible for them in any way.
(1)
  • (Score: 2) by mrpg on Sunday October 19, @02:40PM (4 children)

    by mrpg (5708) <mrpgNO@SPAMsoylentnews.org> on Sunday October 19, @02:40PM (#1421290) Homepage

    A summary of the 15 minutes video, plis? 52!, ok, and then? Broke how? I only saw 3 minutes.

    • (Score: 3, Informative) by jelizondo on Sunday October 19, @03:56PM (3 children)

      by jelizondo (653) Subscriber Badge on Sunday October 19, @03:56PM (#1421300) Journal

      The video is quite interesting, much more than TFA.

      There are two threads: 1) AI can't says a large number (52!), it can calculate but just can't speak it out. 2) The guy asks about music (recording, mixing, arranging) and AI replies it knows all about it because it has studied the great in the business. The guy says it can't know what AI is talking about because there are no interviews or other materials in the Internet on how great sound engineers, composers, etc. do their work, so AI can't have studied it.

      • (Score: 5, Insightful) by mcgrew on Sunday October 19, @08:41PM (2 children)

        by mcgrew (701) <publish@mcgrewbooks.com> on Sunday October 19, @08:41PM (#1421323) Homepage Journal

        The video is quite interesting

        I'm not a good video learner. People never speak correctly, have hard to understand accents, misspeak, and are basically shitty vocal communicators, while the written word is usually at least clear. Videos are for demonstrating motion. I neither need nor want talking heads, I'm literate and can read ten times as fast as you can talk unless you're an auctioneer, and who can underatand THEM?

        --
        Mad at your neighbors? Join ICE, $50,000 signing bonus and a LICENSE TO MURDER!
        • (Score: 2) by jelizondo on Sunday October 19, @10:04PM (1 child)

          by jelizondo (653) Subscriber Badge on Sunday October 19, @10:04PM (#1421335) Journal

          I agree with you. When I want to learn something, I want to read it. Video just doesn't have that power: you can stop reading and think about what you just read, you can go back re-read it, etc. without having to do anything other than move your eyes.

          What I meant as "interesting" is that the guy gives actual examples of AI inability and lies about subject about which it can't really know anything.

          • (Score: 0) by Anonymous Coward on Sunday October 19, @11:15PM

            by Anonymous Coward on Sunday October 19, @11:15PM (#1421339)
            For me video is better for stuff like learning how to open (and close) a particular model of laptop.

            Words just don't do it as well.

            As for reading, for youtube videos sometimes the transcript part is good enough to figure stuff out quickly, despite the transcription errors.
  • (Score: 5, Insightful) by Mojibake Tengu on Sunday October 19, @02:52PM (3 children)

    by Mojibake Tengu (8598) on Sunday October 19, @02:52PM (#1421292) Journal

    Obviously, if AI will keep reproducing themselves by a trivial copy-merge method like humans do, there is a high risc of generations degradation similar to human close relatives progeny, now forbidden in human cultures because of socially observed collapse of intellectual capacity.

    Problem is not in data, but in method of its combination, which naturally amplifies divergences.

    I'd suggest a humble return to logical artificial intellects like Logic Programming and keep generative LLMs only for toys.

    Of course that means to bring up next generations of prgrammers who understand logic. Rationalism is not possible with current human population.

    --
    Rust programming language offends both my Intelligence and my Spirit.
    • (Score: 3, Touché) by mcgrew on Sunday October 19, @08:43PM

      by mcgrew (701) <publish@mcgrewbooks.com> on Sunday October 19, @08:43PM (#1421326) Homepage Journal

      Problem is not in data, but in method of its combination

      The problem is that AI cannot make anything out of nothing like humans can.

      --
      Mad at your neighbors? Join ICE, $50,000 signing bonus and a LICENSE TO MURDER!
    • (Score: 2, Interesting) by Anonymous Coward on Monday October 20, @12:34AM

      by Anonymous Coward on Monday October 20, @12:34AM (#1421354)

      there is a high risc of generations degradation similar to human close relatives progeny, now forbidden in human cultures because of socially observed collapse of intellectual capacity.

      Inbreeding can strengthen traits:
      https://en.wikipedia.org/wiki/Ashkenazi_Jews [wikipedia.org]

      Ashkenazim distinctiveness as found in the Bray and co-authors study may come from their ethnic endogamy (ethnic inbreeding), which allowed them to "mine" their ancestral gene pool in the context of relative reproductive isolation from European neighbors, and not from clan endogamy (clan inbreeding).

      Though Ashkenazi Jews have never exceeded 3% of the American population, Jews account for 37% of the winners of the U.S. National Medal of Science, 25% of the American Nobel Prize winners in literature, and 40% of the American Nobel Prize winners in science and economics

      https://en.wikipedia.org/wiki/John_von_Neumann#Mathematical_quickness [wikipedia.org]

      Von Neumann's mathematical fluency, calculation speed, and general problem-solving ability were widely noted by his peers. Paul Halmos called his speed "awe-inspiring."[384] Lothar Wolfgang Nordheim described him as the "fastest mind I ever met".[385] Enrico Fermi told physicist Herbert L. Anderson: "You know, Herb, Johnny can do calculations in his head ten times as fast as I can! And I can do them ten times as fast as you can, Herb, so you can see how impressive Johnny is!"

      Nobel Laureate Hans Bethe said "I have sometimes wondered whether a brain like von Neumann's does not indicate a species superior to that of man".[29] Edward Teller observed "von Neumann would carry on a conversation with my 3-year-old son, and the two of them would talk as equals, and I sometimes wondered if he used the same principle when he talked to the rest of us."

      Some traits might be unwanted of course:
      https://en.wikipedia.org/wiki/Medical_genetics_of_Jews [wikipedia.org]

      There are several autosomal recessive genetic disorders that are more common than average in ethnically Jewish populations, particularly Ashkenazi Jews, because of relatively recent population bottlenecks and because of consanguineous marriage (marriage of second cousins or closer).[1] These two phenomena reduce genetic diversity and raise the chance that two parents will carry a mutation in the same gene and pass on both mutations to a child.

    • (Score: 2) by driverless on Monday October 20, @04:18AM

      by driverless (4770) on Monday October 20, @04:18AM (#1421387)

      This has been known for awhile, it's called model collapse. We've already seen signs of it for a year or so now, just patched over or around by rerunning the query or tweaking the model.

  • (Score: 4, Informative) by looorg on Sunday October 19, @03:06PM (2 children)

    by looorg (578) on Sunday October 19, @03:06PM (#1421293)

    So AI is now going to eat its out output? Great. I hope they choke on it. This will be funtastic. Cause copies-of-copies-of-copies have never gone wrong before ... It's like animals that mate with siblings. Output not great. It should be easier then ever to spot them cause eventually they won't even sound, or write, like humans anymore. They copy themselves and won't be able to tell the difference.

    • (Score: 3, Funny) by Gaaark on Sunday October 19, @03:36PM

      by Gaaark (41) on Sunday October 19, @03:36PM (#1421298) Journal

      Output not great

      Yup...just look at Prince Charles! (HE'S NOT MUH KING!)... and Prince Andrew

      --
      --- Please remind me if I haven't been civil to you: I'm channeling MDC. I have always been here. ---Gaaark 2.0 --
    • (Score: 2) by hendrikboom on Tuesday October 21, @02:38PM

      by hendrikboom (1125) on Tuesday October 21, @02:38PM (#1421607) Homepage Journal

      So AI is now going to eat its out output?

      Isn't that just the generic advice to eat your own dog food?

      I imagine that it will cause them to do some real quality control
      Or else realise that the whole project has reached its limits and give up.

  • (Score: 5, Insightful) by pTamok on Sunday October 19, @03:45PM (2 children)

    by pTamok (3042) on Sunday October 19, @03:45PM (#1421299)

    There's a concept in statistics: Regression to(wards) the mean [wikipedia.org].

    Given that the output of LLMs is the statistically most likely following text from the prompt, feeding back results as input will reinforce the statistically most likely text - in other words, relatively rare text will become rarer; so what you will get is not necessarily the correct answers to questions, but the most frequently occurring answers, reinforced by repetition, which could well be wrong.

    • (Score: 2, Insightful) by Anonymous Coward on Sunday October 19, @06:09PM (1 child)

      by Anonymous Coward on Sunday October 19, @06:09PM (#1421313)

      Sounds just like humans, tell a lie often and forcefully enough, and they will believe, and kill anybody that doesn't

      • (Score: 2) by darkfeline on Friday October 24, @08:42AM

        by darkfeline (1030) on Friday October 24, @08:42AM (#1422008) Homepage

        I find it fascinating as LLMs progress, all of the problems that people point out are the same problems that humans exhibit.

        --
        Join the SDF Public Access UNIX System today!
  • (Score: 4, Touché) by Thexalon on Sunday October 19, @04:42PM (9 children)

    by Thexalon (636) on Sunday October 19, @04:42PM (#1421304)

    The AI bots have read the entire Internet's worth of data, and still aren't super-geniuses, so now we're going to have some AI slop generators invent data to be read by AI readers to make more AI generative slop, which can then be read by AI readers, which can then be used to generate more slop, ... And around and around and around we go, and at no point has anybody ever proven that this will accomplish anything.

    That seems like a really useful purpose for $160 billion in the US alone.

    --
    "Think of how stupid the average person is. Then realize half of 'em are stupider than that." - George Carlin
    • (Score: 2) by looorg on Sunday October 19, @04:49PM (3 children)

      by looorg (578) on Sunday October 19, @04:49PM (#1421305)

      Sounds about right. If they are running out of training data. Does that not mean that they have then consumed all of human knowledge? All the accessible human knowledge written down. And this is the best we are going to get? It's not really that impressive in that regard, or we are not that impressive to put things into perspective. Which is why I though that this will be funtastic or utterly horrible as we will now see what happens when AI/LLM basically decided to become the human centipede (don't look it up, once seen it can't be unseen) and eating their own output. It's going to be slop all the way down from here ...

      • (Score: 3, Insightful) by corey on Sunday October 19, @08:42PM

        by corey (2202) on Sunday October 19, @08:42PM (#1421325)

        Well, all the human knowledge that’s written on the internet. That’s not much. All the good stuff is written in physical books. I suppose that some LLMs have had access to ebooks, given the lawsuits, too.

      • (Score: 0) by Anonymous Coward on Monday October 20, @12:41AM

        by Anonymous Coward on Monday October 20, @12:41AM (#1421355)

        If they are running out of training data. Does that not mean that they have then consumed all of human knowledge?

        If they are running out of training data it means they're on the wrong path. If you need zillions of data to learn stuff, you're not learning stuff, you're just memorizing possible answers and using statistics and heuristics to guess the right ones.

        I bet you do not need a million samples of data to teach a smart[1] dog or even a crow the difference between a car and a bus. And when a new bus appears, if it's still buslike enough, it would be considered a bus.

        [1] Some dogs are stupid, so let's exclude those for this example...

      • (Score: 2) by VLM on Monday October 20, @02:09PM

        by VLM (445) Subscriber Badge on Monday October 20, @02:09PM (#1421483)

        And this is the best we are going to get?

        In all fairness, you look at something like a new uni grad, even a "Great Books Curriculum" higher ed grad, who has trained for years on the best books even with some human help along the way, and on average they're not that impressive. No surprise thats what we get from LLMs.

        All human progress comes from very few people indeed. Groups can take credit for individual progress but all progress is individual.

    • (Score: 5, Informative) by Rosco P. Coltrane on Sunday October 19, @06:29PM

      by Rosco P. Coltrane (4757) on Sunday October 19, @06:29PM (#1421316)

      The AI bots have read the entire Internet's worth of data, and still aren't super-geniuses

      Humans are much more energy-efficient at being dumbasses: they only need to read a teeny tiny fraction of the shittiest parts of the internet.

    • (Score: 1, Interesting) by Anonymous Coward on Sunday October 19, @10:45PM (3 children)

      by Anonymous Coward on Sunday October 19, @10:45PM (#1421337)

      > The AI bots have read the entire Internet's worth of data ....

      If the LLM developers really think that "the internet" is somehow equivalent to "human knowledge", then they deserve to fail, and fail hard. It's not just books (as someone else mentioned), it's huge amounts of common sense in experienced people, and a vast amount of trade secrets that most companies guard carefully.

      The useful future of the current round of "AI" technology isn't in giant LLMs that are currently getting all the funding and press. The future is in domain specific problem solving tools that work alongside people. A recent presentation I saw used Alpha Fold as an example, it has greatly sped up protein folding research, leading to all sorts of potentially useful and profitable results. A small company I just visited, has been making their own little "AI" models, to help solve recurring problems that come up in their business.

      • (Score: 3, Interesting) by Thexalon on Monday October 20, @01:15AM

        by Thexalon (636) on Monday October 20, @01:15AM (#1421363)

        Machine learning has been around for a long time for all kinds of specialized tasks, and you're right that applied correctly as a technique it can be a really good one.

        That's not what's being marketed as "AI" though. And that's not the promises being spouted by the likes of Sam Altman.

        --
        "Think of how stupid the average person is. Then realize half of 'em are stupider than that." - George Carlin
      • (Score: 1) by pTamok on Monday October 20, @08:31AM (1 child)

        by pTamok (3042) on Monday October 20, @08:31AM (#1421425)

        The was a project, some time ago, decades before the AI hype of LLMs, where someone was trying to build a repository of 'common sense' by getting people to document such common sense in 'bite-size' chunks of text.

        Ah, here we are: Doug Lenat's Cyc [wikipedia.org], started in 1984,

        Cyc's ontology grew to about 100,000 terms in 1994, and as of 2017, it contained about 1,500,000 terms. The Cyc knowledge base involving ontological terms was largely created by hand axiom-writing; it was at about 1 million in 1994, and as of 2017, it was at about 24.5 million.

        Gary Marcus, a cognitive scientist and the cofounder of an AI company called Geometric Intelligence, said in 2016 that "it [Cyc] represents an approach that is very different from all the deep-learning stuff that has been in the news." This is consistent with Doug Lenat's position that "Sometimes the veneer of intelligence is not enough".

        It's another approach to artificial thinking aids, and required heavy curation of its input - 'hand axiom writing'. It hasn't had 'overnight success', yet, and may never do. But it illustrates the difficulty of encoding 'common sense'. Furthermore, it still misses the point by a country mile: expecting knowledge to be completely accessible via a text-based medium is likely an insufficiently rich approach. Many people contrast 'book-leaning' with 'learning by doing' - you can read all you like about how to ride a bicycle, but you need real world experience and practice to actually be able to do so. The same applies in many areas of knowledge, which are currently completely inaccessible to LLMs. Knowing how to ride a bicycle is a trivial existence proof of knowledge inaccessible to text-based approaches - but the point is that real world physical process-based knowledge is both important, and hard. My robot launderer is very good, but refuses to check the pockets for things that should not be washed, and will happily wash a non-colourfast item with my whites. It doesn't learn from experience.

        There's more to intelligence than a comprehensive knowledge-base, and we have collectively spent a great deal of money and resources exhaustively demonstrating that.

        • (Score: 2) by hendrikboom on Tuesday October 21, @02:53PM

          by hendrikboom (1125) on Tuesday October 21, @02:53PM (#1421613) Homepage Journal

          There's an open version of Cyc, called OpenCyc.
          I've casually looked at it, as well as Wikidata. It looks as if it should be useful, but I find it difficult to know what to do with it.

  • (Score: 5, Informative) by mcgrew on Sunday October 19, @08:36PM

    by mcgrew (701) <publish@mcgrewbooks.com> on Sunday October 19, @08:36PM (#1421322) Homepage Journal

    With the web tapped out, developers are turning to synthetic data

    In Non-marketspeak, With the web tapped out, developers are turning to FICTION. Just what you need in your next sales report, or medical dignosis, or CIA data...

    --
    Mad at your neighbors? Join ICE, $50,000 signing bonus and a LICENSE TO MURDER!
  • (Score: 3, Interesting) by sonamchauhan on Monday October 20, @03:28AM

    by sonamchauhan (6546) on Monday October 20, @03:28AM (#1421381)

    There is data there; they'd have to work at getting the artifacts:

    - OCR'ed notebooks of schoolkids, from the Victorian era to now
    - CIA, FBI and police casefiles
    - hospital records
    - personal diaries
    - Alexa logs
    - ICQ, AOL and BBS chat logs
    - all proprietary software ever created (perhaps with source code decompiled from binary, but binary works unless encrypted)
    - VLSI design files
    - pyrolysed manuscripts from excavations at Pompeii. Decoded via MRI scans
    - Nuclear and mitochondrial DNA scans of bacteria (human DNA ingested already)
    - viral RNA
    - protein structure databases
    - historical star charts and astronomical observations
    - ongoing ramblings at Soylentnews and similae (those too poorly formatted and whimsical to be generated by AI, such as mine)
    - chess and Go game databases.

  • (Score: 2) by VLM on Monday October 20, @02:06PM

    by VLM (445) Subscriber Badge on Monday October 20, @02:06PM (#1421482)

    Hire people to generate data.

    That'll be expensive compared to "steal all the historical data for free"

    Here's an interesting analogy similar to some "why are there no space aliens here on earth yet?"

    Its a natural ecological event for more creative content to be easily accessible online, until there's enough that LLM generates, then because its expensive to generate content there's a population crash where infinite LLM generated temporarily profitable spam floods the online until even the LLMs die off.

    Kind of like you can take a barrel of water, and slowly add sugar, maybe for a VERY long time, but when yeast is either created or invented or evolved or introduced, BAM in a matter of days all that sugar is permanently gone and you get ethanol (later acetic acid vinegar if you also allow some oxygen) Then when the barrel is full of "rot" (assuming you don't like grain alcohol) the sugar stops being added because its wasted and unusable.

    Thats the future of the internet.

    The future being very unevenly distributed you can kind of see that now on legacy networks like Usenet. Visiting social spaces on the internet in 2030s and 2040s will probably feel much like trying to visit Usenet this century to socialize...

(1)