Slash Boxes

SoylentNews is people

posted by hubie on Wednesday September 06 2023, @07:08AM   Printer-friendly

With hopes and fears about this technology running wild, it's time to agree on what it can and can't do:

When Taylor Webb played around with GPT-3 in early 2022, he was blown away by what OpenAI's large language model appeared to be able to do. Here was a neural network trained only to predict the next word in a block of text—a jumped-up autocomplete. And yet it gave correct answers to many of the abstract problems that Webb set for it—the kind of thing you'd find in an IQ test. "I was really shocked by its ability to solve these problems," he says. "It completely upended everything I would have predicted."

[...] Last month Webb and his colleagues published an article in Nature, in which they describe GPT-3's ability to pass a variety of tests devised to assess the use of analogy to solve problems (known as analogical reasoning). On some of those tests GPT-3 scored better than a group of undergrads. "Analogy is central to human reasoning," says Webb. "We think of it as being one of the major things that any kind of machine intelligence would need to demonstrate."

What Webb's research highlights is only the latest in a long string of remarkable tricks pulled off by large language models. [...]

And multiple researchers claim to have shown that large language models can pass tests designed to identify certain cognitive abilities in humans, from chain-of-thought reasoning (working through a problem step by step) to theory of mind (guessing what other people are thinking).

These kinds of results are feeding a hype machine predicting that these machines will soon come for white-collar jobs, replacing teachers, doctors, journalists, and lawyers. Geoffrey Hinton has called out GPT-4's apparent ability to string together thoughts as one reason he is now scared of the technology he helped create.

But there's a problem: there is little agreement on what those results really mean. Some people are dazzled by what they see as glimmers of human-like intelligence; others aren't convinced one bit.

"There are several critical issues with current evaluation techniques for large language models," says Natalie Shapira, a computer scientist at Bar-Ilan University in Ramat Gan, Israel. "It creates the illusion that they have greater capabilities than what truly exists."

That's why a growing number of researchers—computer scientists, cognitive scientists, neuroscientists, linguists—want to overhaul the way they are assessed, calling for more rigorous and exhaustive evaluation. Some think that the practice of scoring machines on human tests is wrongheaded, period, and should be ditched.

"People have been giving human intelligence tests—IQ tests and so on—to machines since the very beginning of AI," says Melanie Mitchell, an artificial-intelligence researcher at the Santa Fe Institute in New Mexico. "The issue throughout has been what it means when you test a machine like this. It doesn't mean the same thing that it means for a human."

[...] "There is a long history of developing methods to test the human mind," says Laura Weidinger, a senior research scientist at Google DeepMind. "With large language models producing text that seems so human-like, it is tempting to assume that human psychology tests will be useful for evaluating them. But that's not true: human psychology tests rely on many assumptions that may not hold for large language models."

Webb is aware of the issues he waded into. "I share the sense that these are difficult questions," he says. He notes that despite scoring better than undergrads on certain tests, GPT-3 produced absurd results on others. For example, it failed a version of an analogical reasoning test about physical objects that developmental psychologists sometimes give to kids.

[...] A lot of these tests—questions and answers—are online, says Webb: "Many of them are almost certainly in GPT-3's and GPT-4's training data, so I think we really can't conclude much of anything."

[...] The performance of large language models is brittle. Among people, it is safe to assume that someone who scores well on a test would also do well on a similar test. That's not the case with large language models: a small tweak to a test can drop an A grade to an F.

"In general, AI evaluation has not been done in such a way as to allow us to actually understand what capabilities these models have," says Lucy Cheke, a psychologist at the University of Cambridge, UK. "It's perfectly reasonable to test how well a system does at a particular task, but it's not useful to take that task and make claims about general abilities."

[...] "The assumption that cognitive or academic tests designed for humans serve as accurate measures of LLM capability stems from a tendency to anthropomorphize models and align their evaluation with human standards," says Shapira. "This assumption is misguided."

[...] The trouble is that nobody knows exactly how large language models work. Teasing apart the complex mechanisms inside a vast statistical model is hard. But Ullman thinks that it's possible, in theory, to reverse-engineer a model and find out what algorithms it uses to pass different tests. "I could more easily see myself being convinced if someone developed a technique for figuring out what these things have actually learned," he says.

"I think that the fundamental problem is that we keep focusing on test results rather than how you pass the tests."

Original Submission

This discussion was created by hubie (1068) for logged-in users only, but now has been archived. No new comments can be posted.
Display Options Threshold/Breakthrough Mark All as Read Mark All as Unread
The Fine Print: The following comments are owned by whoever posted them. We are not responsible for them in any way.
  • (Score: 5, Insightful) by pTamok on Wednesday September 06 2023, @11:22AM (5 children)

    by pTamok (3042) on Wednesday September 06 2023, @11:22AM (#1323367)

    My interactions with LLMs so far have been such that my experience is that they cannot hold a conversation in any depth on technical topics that I have knowledge in.

    Essentially, a conversation where you go increasingly in depth on a topic. There's a point at which they run out of expertise and either (at best) repeat themselves, or start contradicting themselves. The output of a single prompt: Write a poem on the subject of Pre-Raphaelite painters; Summarise the arguments for and against quantitative easing; Write a story for children with a purple dragon and a teddy-bear - are very impressive. Going into detail and requiring conceptual permanence during the conversation seems beyond them.

    So far, I think they are interesting toys, brilliant (and certainly better than me) at generating certain forms of text, but absolutely not intelligent.

    Starting Score:    1  point
    Moderation   +4  
       Insightful=4, Total=4
    Extra 'Insightful' Modifier   0  

    Total Score:   5  
  • (Score: 4, Interesting) by VLM on Wednesday September 06 2023, @11:44AM (2 children)

    by VLM (445) on Wednesday September 06 2023, @11:44AM (#1323370)

    I would agree with that and suggest an experiment I've done; go into advanced math and computer science and talk "about" the topics and it'll word salad with the best of them, chopping up and remixing the definitions. But it clearly doesn't know what any of it means.

    "Write a country western song explaining a bubble sort" - It'll do pretty well

    "Write a bubble sort in Python" - It's seen a million of them online it'll do pretty well; see also fizzbuzz

    but do anything complicated where it can't more or less plagiarize something it read or word salad chop and remix what it already saw, and its done.

    The other experience I've seen is every piece of information technology that "can" do stuff is almost never run to even a fraction of its capacity by normal users. A graphics artist from 1960 could use MS Word to lay out documents. But it doesn't replace a graphics artist from 2023 because the average user has no idea what to do. Giving normies tools doesn't work; imagine tossing a tone deaf person into a music store with a credit card and expecting an orchestra to walk out; not going to happen. ALL information tech including "AI" is like that.

    • (Score: 3, Interesting) by Freeman on Wednesday September 06 2023, @02:18PM (1 child)

      by Freeman (732) on Wednesday September 06 2023, @02:18PM (#1323429) Journal

      I would take it a step further. You can't trust LLMs to be right and you can't trust that LLMs will be wrong. Thus, you can't trust them at all. You can still make use of them, but you can't trust that anything they spout will be accurate.

      Joshua 1:9 "Be strong and of a good courage; be not afraid, neither be thou dismayed: for the Lord thy God is with thee"
      • (Score: 2) by VLM on Wednesday September 06 2023, @06:41PM

        by VLM (445) on Wednesday September 06 2023, @06:41PM (#1323484)

        Which makes the breathless financial estimates about AI very questionable. Best case scenario, MAYBE in some jobs the human is now the lead of a couple front line knowledge workers whom are now AI. But the AI will fail often enough that you still mostly need human capacity and capability to do the job so it's not going to be main line processes but likely optimization processes, and honestly most jobs aren't open to optimization (I don't mean 'most' as in limited to IT, but 'most' as in hourly W2 employment).

        ... if McDonalds could replace their order takers with Alexa, they would have years ago ... So all the breathless claims about replacing generic office workers seem a bit optimistic.

  • (Score: 4, Interesting) by ikanreed on Wednesday September 06 2023, @02:51PM (1 child)

    by ikanreed (3164) Subscriber Badge on Wednesday September 06 2023, @02:51PM (#1323434) Journal

    LLMs understand exactly one thing: the relationships between words.

    This is how we encode a lot of our understanding of the world, but not all. When you really understand a concept, it often involves inferences that come from applying conceptual rules to break a complex question down into a set of simpler ones.

    So it can understand from prior reading that you need a space suit to breathe in space, and that a fish needs water to breathe, but "My fish died from suffocation in space, even though I put it in a space suit, why?" has a good chance of tripping it up, because it requires an inference of the actual mechanics of something in a context it hasn't seen.

    • (Score: 2) by ikanreed on Thursday September 07 2023, @04:00AM

      by ikanreed (3164) Subscriber Badge on Thursday September 07 2023, @04:00AM (#1323532) Journal

      Alright, I've checked my test case and the LLMs do okay with it. Better examples appear to be more... mathy, since math doesn't translate to language so well.