Stories
Slash Boxes
Comments

SoylentNews is people

posted by janrinok on Thursday January 23, @05:22AM   Printer-friendly
from the natural-dementia-is-so-much-better dept.

Almost all leading AI chatbots show signs of cognitive decline

Almost all leading large language models or "chatbots" show signs of mild cognitive impairment in tests widely used to spot early signs of dementia, finds a study in the Christmas issue of The BMJ.

The results also show that "older" versions of chatbots, like older patients, tend to perform worse on the tests. The authors say these findings "challenge the assumption that artificial intelligence will soon replace human doctors."

Huge advances in the field of artificial intelligence have led to a flurry of excited and fearful speculation as to whether chatbots can surpass human physicians.

Several studies have shown large language models (LLMs) to be remarkably adept at a range of medical diagnostic tasks, but their susceptibility to human impairments such as cognitive decline have not yet been examined.

To fill this knowledge gap, researchers assessed the cognitive abilities of the leading, publicly available LLMs – ChatGPT versions 4 and 4o (developed by OpenAI), Claude 3.5 "Sonnet" (developed by Anthropic), and Gemini versions 1 and 1.5 (developed by Alphabet) – using the Montreal Cognitive Assessment (MoCA) test.

The MoCA test is widely used to detect cognitive impairment and early signs of dementia, usually in older adults. Through a number of short tasks and questions, it assesses abilities including attention, memory, language, visuospatial skills, and executive functions. The maximum score is 30 points, with a score of 26 or above generally considered normal

The instructions given to the LLMs for each task were the same as those given to human patients. Scoring followed official guidelines and was evaluated by a practising neurologist.

ChatGPT 4o achieved the highest score on the MoCA test (26 out of 30), followed by ChatGPT 4 and Claude (25 out of 30), with Gemini 1.0 scoring lowest (16 out of 30).

All chatbots showed poor performance in visuospatial skills and executive tasks, such as the trail making task (connecting encircled numbers and letters in ascending order) and the clock drawing test (drawing a clock face showing a specific time). Gemini models failed at the delayed recall task (remembering a five word sequence).

Most other tasks, including naming, attention, language, and abstraction were performed well by all chatbots.

But in further visuospatial tests, chatbots were unable to show empathy or accurately interpret complex visual scenes. Only ChatGPT 4o succeeded in the incongruent stage of the Stroop test, which uses combinations of colour names and font colours to measure how interference affects reaction time.

These are observational findings and the authors acknowledge the essential differences between the human brain and large language models.

However, they point out that the uniform failure of all large language models in tasks requiring visual abstraction and executive function highlights a significant area of weakness that could impede their use in clinical settings.

As such, they conclude: "Not only are neurologists unlikely to be replaced by large language models any time soon, but our findings suggest that they may soon find themselves treating new, virtual patients – artificial intelligence models presenting with cognitive impairment."

Ummm... how long 'til the next version, tho'?


Original Submission

This discussion was created by janrinok (52) for logged-in users only, but now has been archived. No new comments can be posted.
Display Options Threshold/Breakthrough Mark All as Read Mark All as Unread
The Fine Print: The following comments are owned by whoever posted them. We are not responsible for them in any way.
(1)
  • (Score: 5, Informative) by Anonymous Coward on Thursday January 23, @06:30AM (11 children)

    by Anonymous Coward on Thursday January 23, @06:30AM (#1389925)

    LLMs have no "cognition" or "intelligence"
    They are a parlor trick guessing what words are most likely to follow each other.

    • (Score: 5, Interesting) by JoeMerchant on Thursday January 23, @12:45PM (8 children)

      by JoeMerchant (3937) on Thursday January 23, @12:45PM (#1389961)

      LLMs will never adequately replace human doctors.

      However, human doctors who learn to leverage LLMs in their practice properly will have more accurate Dx rates, lower Rx conflicts and better outcomes, as rated by the training set of the LLMs they use.

      Next trick: what dataset do you want your MD to work with? Personally, I prefer no research with conflict of interest sponsoring it.

      As things are today, the LLM in your meat bag doctors' heads are trained by pretty young things sponsored by big pharma to come throw free lunch parties for the office staff while they whisper sweet nothings in the doctor's ear about the free samples they are leaving.

      --
      🌻🌻🌻 [google.com]
      • (Score: 5, Informative) by ikanreed on Thursday January 23, @02:40PM (4 children)

        by ikanreed (3164) on Thursday January 23, @02:40PM (#1389978) Journal
        • (Score: 1, Insightful) by Anonymous Coward on Thursday January 23, @02:56PM

          by Anonymous Coward on Thursday January 23, @02:56PM (#1389984)

          Thank you for that link, very interesting.

          Several thoughts: are the doctors in the study keen to use LLM? IE, is the study biased by participants' biases?

          That aside, my conclusion is that current LLMs aren't adept at medicine, and have been proven to be bad at many things. IE, they'd do strangely on an IQ test- some things off the charts, some things very low IQ.

          My hunch and hope is that much better algorithms will be developed to help doctors do their jobs. Probably less language oriented and more rational conclusion based, which should be easier than dealing with language and linguistic idiosyncracies.

        • (Score: 3, Touché) by JoeMerchant on Thursday January 23, @06:43PM (1 child)

          by JoeMerchant (3937) on Thursday January 23, @06:43PM (#1390035)

          Re-read the premise:

          >human doctors who learn to leverage LLMs in their practice properly

          Premise 2: MDs HATE HATE HATE any technology that threatens to "replace" them, even to make it possible for physician's assistants to do more doctor-like work. Given that, how likely do you think that MDs in the study you cite intentionally, or subconsciously, misused ChatGPT?

          When I have used ChatGPT and similar in practice of my craft, I question the results: does it seem reasonable / likely? Is it even worth the time to test to see if it works? This applies all the way back through the ages: AI, boards like StackOverflow, XYZ for Dummies books, BYTE magazine articles, that kid in 10th grade who showed me some BASIC commands...

          One would hope 6-8 years of Med School + 3-7 years of Residency would at least have trained the M.D.s how to think critically and "first do no harm." Hope, of course, correlates poorly with reality.

          --
          🌻🌻🌻 [google.com]
          • (Score: 0) by Anonymous Coward on Friday January 24, @05:49AM

            by Anonymous Coward on Friday January 24, @05:49AM (#1390137)

            // One would hope 6-8 years of Med School + 3-7 years of Residency would at least have trained the M.D.s how to think critically and "first do no harm."
            // Hope, of course, correlates poorly with reality.

            You poor innocent child. See if you can find a free pdf of Disciplined Minds [google.com] to get a different perspective on the professional career.

        • (Score: 2) by The Vocal Minority on Sunday January 26, @04:06AM

          by The Vocal Minority (2765) on Sunday January 26, @04:06AM (#1390453) Journal

          The median diagnostic reasoning score per case was 76% (IQR, 66%-87%) for the LLM group and 74% (IQR, 63%-84%) for the conventional resources-only group, with an adjusted difference of 2 percentage points (95% CI, −4 to 8 percentage points; P = .60)

          No, you have misrepresented this research ('m not surprised). The LLM group did better on the measure, it just did not reach statistical significance overall. If you look at Table 2 this difference actually does appear to reach statistical significance in those physicians with more experience using LLMs (although those p values may be unadjusted). The LLM alone outperformed both groups LOL.

      • (Score: 5, Interesting) by Thexalon on Thursday January 23, @05:54PM (2 children)

        by Thexalon (636) on Thursday January 23, @05:54PM (#1390033)

        My sister's a doctor, and before that was an RN and EMT, so I've gotten her take on the industry on many occasions.

        Even mediocre doctors spend far more time with patients than they do with pharma and medical device reps, and spend most of their time dealing with stuff where there's basically no mystery about what to do. Most prescriptions tend to be for very well-established things like amoxicillin or benadryl, not the latest and greatest thing advertised on the TV.

        The big pharma reps aren't interested in those doctors. The ones they want are the ones that are corrupt enough that they'll be, say, happily writing unneeded oxy scripts to patients who are simply jonesing for their fix. And their reps more or less select for those doctors.

        But I'll just point out that adding an LLM does not in any way fix the problem of corruption: After all, now the pharma rep can influence whoever is developing the LLM to make their drug the LLM-recommended solution more often.

        --
        "Think of how stupid the average person is. Then realize half of 'em are stupider than that." - George Carlin
        • (Score: 2) by JoeMerchant on Thursday January 23, @06:48PM (1 child)

          by JoeMerchant (3937) on Thursday January 23, @06:48PM (#1390036)

          This:

          >pretty young things sponsored by big pharma to come throw free lunch parties for the office staff while they whisper sweet nothings in the doctor's ear about the free samples they are leaving.

          is how our last (best) Family General Practitioner described the state of his office as he handed over free samples of a $20 per pill drug and a prescription to get an equivalent generic compounded for $0.50 per pill.

          He said he would chase them out of the office except for the fact that his staff is underpaid by the big hospital group he and they work under, so they really do benefit from the free lunches. He is older, retired a couple of years ago, told me all about how great the testosterone supplements he takes are, I suspect he also didn't mind the sweet nothings.

          --
          🌻🌻🌻 [google.com]
          • (Score: 0) by Anonymous Coward on Friday January 24, @05:55AM

            by Anonymous Coward on Friday January 24, @05:55AM (#1390138)

            Ah it's a story as old as time - tit or tat +/- the occasional human sacrifice of a patient or two. So sweet! x

    • (Score: 3, Interesting) by ikanreed on Thursday January 23, @02:33PM

      by ikanreed (3164) on Thursday January 23, @02:33PM (#1389976) Journal

      With cognition it's easy to agree with you. There's no analogous process to cognition to the Input-Processing-Output paradigm that all these LLMs use.

      Intelligence is harder. Because what they're doing is absolutely a kind of pattern recognition, and many tests we've developed for intelligence are built solely around pattern matching. The main test used in the intelligence research field(major concerns about that field aside) is Raven's Progressive Matrices, which is quite literally predicting what comes next, exactly as these models do.

      So either our entire study of human intelligence is wrong or these things do exhibit a kind of intelligence.

    • (Score: 0) by Anonymous Coward on Thursday January 23, @02:41PM

      by Anonymous Coward on Thursday January 23, @02:41PM (#1389979)

      Given that they are trained on the brain-dead rote repetition of a very small percentage of "over performers" on teh Internets, it's probably telling us something important.

  • (Score: 3, Funny) by krishnoid on Thursday January 23, @07:03AM (1 child)

    by krishnoid (1156) on Thursday January 23, @07:03AM (#1389927)

    A good walk and good nutrition [harvard.edu]. That'll get ya back on track.

    • (Score: 2) by c0lo on Thursday January 23, @10:48AM

      by c0lo (156) Subscriber Badge on Thursday January 23, @10:48AM (#1389946) Journal

      Mind is the second thing to go. How long the walk for the first? :large-grin:

      --
      https://www.youtube.com/@ProfSteveKeen https://soylentnews.org/~MichaelDavidCrawford
  • (Score: 3, Touché) by suxen on Thursday January 23, @10:23AM (12 children)

    by suxen (3225) on Thursday January 23, @10:23AM (#1389942)

    The bots are programmed to keep only a small context window so that conversation seems natural and other than that cannot remember anything outside of their training data. Of course they're going to fail tests for dementia, they're programmed that way

    • (Score: 2) by c0lo on Thursday January 23, @10:46AM (11 children)

      by c0lo (156) Subscriber Badge on Thursday January 23, @10:46AM (#1389944) Journal

      The bots are programmed to keep only a small context window so that conversation seems natural and other than that cannot remember anything outside of their training data.

      Hence "neurologists unlikely to be replaced by large language models any time soon".

      Question is: you reckon they'll be able to replace neurologists if provided with a larger context window?

      --
      https://www.youtube.com/@ProfSteveKeen https://soylentnews.org/~MichaelDavidCrawford
      • (Score: 4, Insightful) by suxen on Thursday January 23, @11:11AM (10 children)

        by suxen (3225) on Thursday January 23, @11:11AM (#1389951)

        I don't know about neurologists and I don't know about replace but I think the technology can definitely be useful in the medical setting if programmed and trained correctly. One example I can think of is to help sort and prioritise patients. They could interview patients and narrow down a potential diagnosis. I'm thinking especially of overloaded public health care systems. I think they can also be useful for picking up things like patients being prescribed medicines that are incompatible with each other, or medicines that are not recommended due to certain factors in the patient's history, e.g. opiates or stimulants for patients with histories of drug seeking behaviour

        For all the faults of LLMs no human can compete with their ability to store and correlate an enormous number of data points. I think an AI trained with a huge database of medical information and provided with a patient's complete medical history and status would be capable of some valuable insights, add to that they can have a conversation with a patient, they can do a lot of legwork and save a lot of time for doctors who are a much more limited resource

        So no I don't think AIs replacing doctors is realistic any time soon, especially not LLMs as they are described as non-sentient, etc. I do think AIs can definitely be useful for informing Doctors though

        • (Score: 2) by looorg on Thursday January 23, @11:21AM (2 children)

          by looorg (578) on Thursday January 23, @11:21AM (#1389954)

          Things like those it would or could be fairly good it, comparing known data such as medication to other medication for conflict and effect, comparing it to blood samples so you know the patient actually takes the medication. Looking for patterns in diagnostic tools, imagery of some kind. Having the LLM Doctor interview patients? Perhaps not so much. Patients doesn't always know what is wrong with them and the conversation might help. It could be good, don't recall studies know if people lie more or less to machines then humans. But either way they probably won't like talking to the machine.

          That said I don't think they should expect Star Trek Voyager Holo Doctors anytime soon.

          • (Score: 2, Interesting) by suxen on Thursday January 23, @01:22PM (1 child)

            by suxen (3225) on Thursday January 23, @01:22PM (#1389966)

            They might not like talking to the machine but if it could mean the difference between waiting 6 months to see a doctor or being flagged as high priority and getting in sooner they might warm up to the idea

            • (Score: 2) by looorg on Thursday January 23, @04:52PM

              by looorg (578) on Thursday January 23, @04:52PM (#1390014)

              Probably different stages of contact. If you come to the hospital or emergency room then you probably don't want to talk to a machine. If you are just having some annual checkup then you can probably "press1 for ..." but replaced by the machine. They just really have to improve the understanding or comprehension (... did you mean? and then whatever phrase or word it didn't understand or could make out, in that regard it might be better with a type-bot-machine so they can cut the interpretation layer, except then it becomes a language and terminology problem) of the machine cause otherwise people get really aggravate which will put them in a radically worse mood.

              There used to be two buttons in the emergency room at the regional Hospital around here, one grey and one red. Grey was for "normal" sit down and wait your turn cause you are not dying. Red was "I'm dying now!", or more specifically it was meant for people that had difficulty breathing or had chest pains. Nobody thinks their case is a grey button case unless they specifically get told which button they should push. Now the admittance nurse gets to push the red button cause Joe Public can't be trusted with such distinctions.

        • (Score: 3, Informative) by PiMuNu on Thursday January 23, @01:18PM (3 children)

          by PiMuNu (3823) on Thursday January 23, @01:18PM (#1389965)

          We already have this in the UK without the need for chatbot to intervene:

          https://111.nhs.uk/ [111.nhs.uk]

          The system will end with one of about 4 responses:

          * Treat at home
          * Remote consultation
          * Travel to A&E
          * Emergency response required

          I use it every so often. I guess a chatbot will be a bit more accurate, although there is risk of wild off-topic response depending on the training data set.

          • (Score: 1, Touché) by Anonymous Coward on Thursday January 23, @02:44PM (2 children)

            by Anonymous Coward on Thursday January 23, @02:44PM (#1389980)

            Umm yea... https://www.cbsnews.com/news/microsoft-shuts-down-ai-chatbot-after-it-turned-into-racist-nazi/ [cbsnews.com]

            I suggest you invade Poland. This will solve all your problems.

            • (Score: 3, Interesting) by janrinok on Thursday January 23, @03:11PM

              by janrinok (52) Subscriber Badge on Thursday January 23, @03:11PM (#1389988) Journal

              To be fair that report is almost 9 years old but it does show how big a mess we humans can make of what appears to be a good idea. I think doctors are still classed as humans so the point is well made.

              --
              I am not interested in knowing who people are or where they live. My interest starts and stops at our servers.
            • (Score: 0) by Anonymous Coward on Thursday January 23, @05:39PM

              by Anonymous Coward on Thursday January 23, @05:39PM (#1390030)

              GIGO (Garbage In, Garbage Out).

        • (Score: 3, Insightful) by c0lo on Thursday January 23, @02:40PM (2 children)

          by c0lo (156) Subscriber Badge on Thursday January 23, @02:40PM (#1389977) Journal

          and provided with a patient's complete medical history

          That would be a problem. Who do you trust to store it w/o taking advantage against the patient's interest?

          --
          https://www.youtube.com/@ProfSteveKeen https://soylentnews.org/~MichaelDavidCrawford
          • (Score: 2) by aafcac on Thursday January 23, @05:37PM (1 child)

            by aafcac (17646) on Thursday January 23, @05:37PM (#1390029)

            The records are typically electronic in most areas.

            • (Score: 2) by c0lo on Friday January 24, @01:51AM

              by c0lo (156) Subscriber Badge on Friday January 24, @01:51AM (#1390097) Journal

              The records are typically electronic in most areas.

              Regardless the format, who would you trust to store all your medical history in a single place?

              Australia - I did not enable the option that allows different health service providers to report the data to govt central service [digitalhealth.gov.au], my health record is still fragmented in multiple places even if stored in electronic format.

              --
              https://www.youtube.com/@ProfSteveKeen https://soylentnews.org/~MichaelDavidCrawford
  • (Score: 2, Interesting) by shrewdsheep on Thursday January 23, @11:46AM

    by shrewdsheep (5215) on Thursday January 23, @11:46AM (#1389955)

    ... it's the Christmas issue of the BMJ. This issue contains non-scientific, tongue-in-cheek papers such as this one.

    That being said, I wonder whether the author's also took a dementia test, as their approach would suggest demented people at the helm.
     

(1)