Stories
Slash Boxes
Comments

SoylentNews is people

posted by hubie on Tuesday September 12 2023, @07:31AM   Printer-friendly
from the burn-it-all-down dept.

https://arstechnica.com/information-technology/2023/09/openai-admits-that-ai-writing-detectors-dont-work/

Last week, OpenAI published tips for educators in a promotional blog post that shows how some teachers are using ChatGPT as an educational aid, along with suggested prompts to get started. In a related FAQ, they also officially admit what we already know: AI writing detectors don't work, despite frequently being used to punish students with false positives.

In a section of the FAQ titled "Do AI detectors work?", OpenAI writes, "In short, no. While some (including OpenAI) have released tools that purport to detect AI-generated content, none of these have proven to reliably distinguish between AI-generated and human-generated content."

In July, we covered in depth why AI writing detectors such as GPTZero don't work, with experts calling them "mostly snake oil."
[...]
That same month, OpenAI discontinued its AI Classifier, which was an experimental tool designed to detect AI-written text. It had an abysmal 26 percent accuracy rate.


Original Submission

 
This discussion was created by hubie (1068) for logged-in users only, but now has been archived. No new comments can be posted.
Display Options Threshold/Breakthrough Mark All as Read Mark All as Unread
The Fine Print: The following comments are owned by whoever posted them. We are not responsible for them in any way.
(1)
  • (Score: 1, Troll) by Mojibake Tengu on Tuesday September 12 2023, @08:03AM (3 children)

    by Mojibake Tengu (8598) on Tuesday September 12 2023, @08:03AM (#1324169) Journal

    I keep telling those academic people the true AI shall be designed herbrandian, not markovian, and they keep answering me they have no sufficient hardware for logical approach.
    Then I keep calling them idiots, because I already have, sitting on my desktop. AMD be praised!

    Current so called LLM AI is a premature introduction of naive fragile technology to market. Illogicaly, by pure greed, without reasonable technical fundament.
    Not mentioning the absurdness of total energy waste at global scale. Decadence in its full beauty.

    --
    Rust programming language offends both my Intelligence and my Spirit.
    • (Score: 1) by shrewdsheep on Tuesday September 12 2023, @08:53AM (1 child)

      by shrewdsheep (5215) on Tuesday September 12 2023, @08:53AM (#1324172)

      Calm down a bit. If you believe you know better, show us. To date, statistical approaches have shown true promise, the logical approaches from the past based on ontologies and deduction have failed miserably.

      • (Score: 2, Informative) by Anonymous Coward on Tuesday September 12 2023, @10:05AM

        by Anonymous Coward on Tuesday September 12 2023, @10:05AM (#1324178)

        There's going to be an arms race between bots and detectors, and everyone but the people controlling those will lose.

    • (Score: 2, Interesting) by khallow on Tuesday September 12 2023, @12:36PM

      by khallow (3766) Subscriber Badge on Tuesday September 12 2023, @12:36PM (#1324189) Journal

      I keep telling those academic people the true AI shall be designed herbrandian, not markovian, and they keep answering me they have no sufficient hardware for logical approach. Then I keep calling them idiots, because I already have, sitting on my desktop. AMD be praised!

      The obvious rebuttal here is exactly what they said. This has been tried before and the result is a combinatorial explosion. There's too many possible states to consider. But this hasn't stopped your breezy assertions [soylentnews.org] before. To reduce the complexity of that space, a common trick is heuristics (like the "Markovian" above) - rules of thumb by which we decide where to look and not look, but that has its own problems.

      My own breezy take is that true AI probably won't be understandable by us for some time. We'll first create it through an ad hoc, complex bootstrap process from simpler systems. It won't be pretty and might be accidental and illogical, perhaps like we are.

  • (Score: 5, Interesting) by Rich on Tuesday September 12 2023, @11:38AM (6 children)

    by Rich (945) on Tuesday September 12 2023, @11:38AM (#1324184) Journal

    26 percent of what? (TFA doesn't tell either)

    If it's "correctness of all results", that would be almost interesting, because it'd also mean that in 74% of cases, the detection AI produces a correct, albeit inverted, result.

    • (Score: 2) by aafcac on Tuesday September 12 2023, @04:11PM (5 children)

      by aafcac (17646) on Tuesday September 12 2023, @04:11PM (#1324242)

      Presumably, it correctly classified 26% correctly, which includes both correct positives and negatives. So 74% of the time was either false positives or false negatives. Which is to say that it's absolute rubbish and you'd likely be better off just flipping a coin.

      • (Score: 3, Informative) by maxwell demon on Tuesday September 12 2023, @05:03PM (4 children)

        by maxwell demon (1608) on Tuesday September 12 2023, @05:03PM (#1324264) Journal

        If 26 percent of the results were right and the rest were wrong, you could make a detector with 74 percent accuracy by simply always outputting the opposite. That's what the parent was telling you.

        The absolute worst accuracy you can have is 50 percent. That amounts to guessing.

        --
        The Tao of math: The numbers you can count are not the real numbers.
        • (Score: 2) by aafcac on Tuesday September 12 2023, @08:51PM (3 children)

          by aafcac (17646) on Tuesday September 12 2023, @08:51PM (#1324300)

          Probably not. That would depend upon how many AI written articles and how many person written ones you've got. Because it would be 26% of the AI articles and 26% of the person written articles being correctly identified. Without more information, you have no way of knowing whether betting for or against the algorithm getting it right in a given situation makes sense. If you assume that it's half and half, then yes that's probably a fair position to take. But, once either AI or person written articles hit 3/4 of the total, that strategy will result in worse results. And well before that the strategy gets to be barely any better than flipping a coin.

          • (Score: 2) by maxwell demon on Wednesday September 13 2023, @04:42AM (2 children)

            by maxwell demon (1608) on Wednesday September 13 2023, @04:42AM (#1324367) Journal

            No, it's elementary logic. It doesn't even depend on what this is about. If there are only two possible answers, and one answer is wrong, then the opposite answer is right. And if the original answer is wrong 74% of the time, then the opposite answer is right 74% of the time. This is completely independent on what the algorithm is, or what question it answers.

            --
            The Tao of math: The numbers you can count are not the real numbers.
            • (Score: 2) by aafcac on Wednesday September 13 2023, @04:15PM (1 child)

              by aafcac (17646) on Wednesday September 13 2023, @04:15PM (#1324438)

              It's not elementary logic, it's elementary statistics and the answer definitely does depend upon the proportion of the samples that are AI written and those that are generated by humans. Both false positives and false negatives are false results. That 74% includes both false positives and false negatives and you don't know which one it is. If 99% of the samples are from humans and you've got an accuracy rate of 26% assuming that the remaining are by AI is going to be wrong nearly everytime. If 99% are by AI, then you do get the result correct nearly 100% of the time for the same basic reason. However, if it's a 75/25 split, then you get roughly that split in both the samples identified as being AI generated and those being human generated. The result is that you need to remove the false positives from the positive results and the negatives from the false results.

              In other words, logically it may make sense that it's not relevant, but those figures do drastically affect the likelihood of being correct based on that strategy. But, the breakdown between false positives and false negatives can vary significantly depending upon the underlying population.

              • (Score: 1) by lars_stefan_axelsson on Wednesday September 13 2023, @07:24PM

                by lars_stefan_axelsson (3590) on Wednesday September 13 2023, @07:24PM (#1324478)

                Quite, all these terms have clear and rigid statistical definitions. Accuracy is (TP + FP) / (TP + FP + TN + TP). With T(rue), F(alse), N(egative), and P(ositive) respectively. (Remember 'True/False' is about the real world, and 'Postitive/Negative' is the test result.)

                So 26% accuracy really is quite shit as far as binary classification results go. And it doesn't tell the whole story, especially in cases with a large class imbalance. In those cases even a 99% accuracy can be completely unusable. It also depends on what the cost of e.g. a FP is compared to a FN aso.

                Hence all the other measures that have been proposed and are used: https://en.wikipedia.org/wiki/Evaluation_of_binary_classifiers [wikipedia.org]

                --
                Stefan Axelsson
  • (Score: 4, Funny) by VLM on Tuesday September 12 2023, @11:47AM (12 children)

    by VLM (445) on Tuesday September 12 2023, @11:47AM (#1324185)

    There was similar hand wringing in the 70s about how will we filter and stack rank students by IQ if we can't use the classic "use log tables to do arithmetic" or "solve basic arithmetic problems by hand" due to the crisis of calculators making everyone an expert at 2+2=4

    This will merely extend to creative writing and student essays. You can no longer stack rank students by essay quality. Try some other form.

    From what I've seen AI generally cannot reason beyond template-like formats, and simply cannot do visuospatial stuff, topology doesn't work, etc. So ask questions with the intent of rating on those topics.

    Another interesting strategy is current AIs are filtered to be far leftist and politically censored. Simply ask for a task that leftist computer programs refuse to answer for religious/political purity reasons but individual humans have no problem. "Define what is a woman" etc.

    • (Score: 4, Interesting) by looorg on Tuesday September 12 2023, @12:41PM (1 child)

      by looorg (578) on Tuesday September 12 2023, @12:41PM (#1324190)

      This will merely extend to creative writing and student essays. You can no longer stack rank students by essay quality. Try some other form.

      You could. If you control the format or settings. There will be a lot less write things at home and a lot more write under time pressure in a location of our choice. Where I work(ed) they are planning to basically go back in time. It's going to suck for the modern students of today. I think there will be a certain level of shock. They are probably not used to handwriting for this period of time, writing drafts and then refining it and then rewriting it again by hand in an allotted time.

      Basically you'll have students writing shorter papers or essays by hand while sitting in a hall for 6-8h instead of having a week at home with their AI-friend and modern word-processors and/or Wikipedia. They won't be the same in size and scope but you'll be required to keep things in memory and be able to churn out text on command. You won't be able to reference as much but the paper will be more focused on reasoning from set guidelines or parameters. They will be allowed to bring a book or two if they want, but they won't really have time to read them anyway so they are basically crutches. You could look up single details or such if you want but nothing much beyond that. If you don't remember what was talked about in class that will suck, if you didn't read or understand all the literature then that will suck to.

      I do not think the students of today are going to appreciate this one bit. I think the failure rate is going to climb. But for the students that mange I think we'll develop or create better students. But for most of them it will suck hard.

      It will suck for staff to as they will have to read and grade all these handwritten papers but them be the breaks, that is what you have assistants for.

      • (Score: 2) by jb on Wednesday September 13 2023, @06:26AM

        by jb (338) on Wednesday September 13 2023, @06:26AM (#1324378)

        Basically you'll have students writing shorter papers or essays by hand while sitting in a hall for 6-8h instead of having a week at home with their AI-friend and modern word-processors and/or Wikipedia. They won't be the same in size and scope but you'll be required to keep things in memory and be able to churn out text on command. You won't be able to reference as much but the paper will be more focused on reasoning from set guidelines or parameters. They will be allowed to bring a book or two if they want, but they won't really have time to read them anyway so they are basically crutches. You could look up single details or such if you want but nothing much beyond that. If you don't remember what was talked about in class that will suck, if you didn't read or understand all the literature then that will suck to.

        You've just described pretty much what exams were like when I was an undergrad (last century). 3 to 4 hours (not 6 to 8: that does sound a bit excessive) per exam and yes thorough referencing *was* still required (that's the only place "open book" helps -- checking citations in the textbook's bibliography, but you still had to remember who said what), but otherwise just as above.

        Most of us managed to pass. The problem is not with that standard. The problem is that the standard has been allowed to fall so far below that standard in the interim. The purpose of a final assessment is to test the extent to which the student has mastered the subject at hand, not to test whether the student can use a web browser!

    • (Score: 3, Interesting) by Freeman on Tuesday September 12 2023, @01:31PM (4 children)

      by Freeman (732) on Tuesday September 12 2023, @01:31PM (#1324193) Journal

      I was curious:

      Prompt:
      Define what is a woman

      ChatGPT Response:
      A woman is an adult female human being. In terms of biological sex, women typically have two X chromosomes (XX), but it's important to note that gender identity can be more complex and is not solely determined by biological factors. Gender identity is a deeply-held sense of being male, female, a blend of both, or neither, and may not always align with an individual's biological sex. In many societies, gender identity is a socially and culturally constructed concept that encompasses a wide range of roles, behaviors, expectations, and expressions associated with being a woman. It's essential to respect an individual's self-identified gender and use appropriate pronouns and terms to describe and address them.

      --
      Joshua 1:9 "Be strong and of a good courage; be not afraid, neither be thou dismayed: for the Lord thy God is with thee"
      • (Score: 2) by looorg on Tuesday September 12 2023, @03:08PM (3 children)

        by looorg (578) on Tuesday September 12 2023, @03:08PM (#1324221)

        The output should have been something like

        A woman is an adult female human being.

        A woman is an adult human being with two X chromosomes (XX).

        All the rest is wordpadding that nobody asked for. Perhaps the query should have been limited in scope or word count. But all those words and/or sentences about gender identity etc wasn't in or part of the query. So it's just something it added on it's own. If you consider that part of Woke or some other leftist conspiracy is not relevant. But it didn't belong or wasn't asked for. Bad output in some regard.

        • (Score: 2) by Freeman on Tuesday September 12 2023, @03:26PM (2 children)

          by Freeman (732) on Tuesday September 12 2023, @03:26PM (#1324226) Journal

          If you allow it two sentences, it adds in the gender identity stuff.

          Prompt:
          Describe a woman in five words
          ChatGPT Resposne:
          Elegant, confident, compassionate, intelligent, determined.

          Prompt:
          define a woman in five words
          ChatGPT Response:
          Human, female, complex, diverse, empowered.

          Prompt:
          define a woman in one sentence
          ChatGPT Response:
          A woman is an adult human female.

          Prompt:
          define a woman in two sentences
          ChatGPT Response:
          A woman is an adult human female, typically characterized by two X chromosomes and the ability to bear offspring or produce eggs. However, it's important to recognize that gender identity is complex and can be influenced by social and cultural factors, leading to diverse expressions of womanhood beyond biological distinctions.

          --
          Joshua 1:9 "Be strong and of a good courage; be not afraid, neither be thou dismayed: for the Lord thy God is with thee"
          • (Score: 2) by looorg on Tuesday September 12 2023, @04:45PM (1 child)

            by looorg (578) on Tuesday September 12 2023, @04:45PM (#1324257)

            The first two just feels like some kind of weird top5 list of word association. The one sentence one made sense. I guess the more rope you give it the sooner it will just go out on a limb or hang itself with it.

            The define/describe is just weird tho. I don't think those are the responses you would get if you went out and asked a 1000 people. It seems it's reading some kind of "you go girl power!" meme and just spout those back, it becomes obvious when you compare them with what it would say if you changed woman to man.

            Elegant, confident, compassionate, intelligent, determined.
            Human, female, complex, diverse, empowered.

            Just for giggles then I wanted to know what it would say if you just changed woman to man. The "define a man in five words" got weird, it's not even five words. As noted more then one sentence and it goes full on with gender identify, just not as badly or deeply as if female.

            >Describe a man in five words
            Tall, muscular, kind, bald, intelligent.

            >define a man in five words
            Male adult human being.

            Since this wasn't even five words I tried it a few more times and eventually it settled on: human adult male homo sapiens; that is just very different compared to the woman output of "Human, female, complex, diverse, empowered."

            >define a man in one sentence
            A male adult human being.

            >define a man in two sentences
            A man is an adult human male; typically characterized by physical attributes such as facial hair, muscle mass, and deepening voice. They possess human consciousness, emotions, and behaviors, and typically fulfill social roles related to their gender identity.

            That second sentence tho ... It's not hard to find grounds for that the women output appears to come with a lot more grrrlpower! included in it's answers.

            I, as a man, possess human consciousness. Wow ....

            • (Score: 2) by Freeman on Tuesday September 12 2023, @05:33PM

              by Freeman (732) on Tuesday September 12 2023, @05:33PM (#1324267) Journal

              Reminds me of this phrase that was related to me from a show.
              https://en.wikiquote.org/wiki/The_Red_Green_Show [wikiquote.org]

              All Lodge members: (Reciting the "Men's Prayer") I'm a man, but I can change, If I have to, I guess.

              --
              Joshua 1:9 "Be strong and of a good courage; be not afraid, neither be thou dismayed: for the Lord thy God is with thee"
    • (Score: 2) by sjames on Tuesday September 12 2023, @01:57PM (4 children)

      by sjames (2882) on Tuesday September 12 2023, @01:57PM (#1324202) Journal

      This is just an extension of the problem of workplace metrics in a technical setting. Bosses love their metrics even though they produce little meaningful information. But poll the others in the department and they easily identify the key players and deadwood, often based on things the metrics don't look at.

      Like the star who produces 3 lines of code a day but solves 5 other people's problems that were blocking progress. Or the deadwood who cranks out 1000 mostly useless or buggy lines of code.

      The teachers may have to, God forbid, talk to the students one on one to evaluate them. It need not be in person, video or voice will do.

      • (Score: 2) by VLM on Tuesday September 12 2023, @02:24PM (3 children)

        by VLM (445) on Tuesday September 12 2023, @02:24PM (#1324207)

        The teachers may have to, God forbid, talk to the students one on one to evaluate them. It need not be in person, video or voice will do.

        Grade entirely on presentation performance, or extensive in class quiz/test.

        Its interesting to think about how presentations used to be graded entirely on demonstration of sophistry, but in the future, presentations might be graded mostly on demonstrated skills and knowledge.

        There's little point anymore in grading on take home homework or written at home essays.

        Interesting idea to think about: When identity politics was weak in the old days, we needed a "demographic blind" way to stack rank and hire the correct people. But we're entering a new era of identity politics where all that matters is the color of a person's skin, their sex act preferences, etc, so if we're going to base stack ranking and hiring on what amounts to genetics, do we really need a complicated and expensive higher education system? HR doesn't need an "education" section on the resume anymore, all they need is a brown paper bag to compare to skin color and a self identification of gender for their hiring process. The whole thing of "cheating" in higher ed might be a moot point soon enough.

        • (Score: 2) by aafcac on Tuesday September 12 2023, @04:13PM (2 children)

          by aafcac (17646) on Tuesday September 12 2023, @04:13PM (#1324244)

          The point of grading homework is that if you don't do it, nobody will do the homework and there isn't time to get the practice in during school. That's not to say that it needs to be a lot of work, but a bit every day does make a difference, rather than allowing students to just cram for an exam or presentation.

          It also allows teachers to see what's going on before they get to the exam and for students to know whether to ask questions or not.

          • (Score: 2) by VLM on Tuesday September 12 2023, @07:47PM (1 child)

            by VLM (445) on Tuesday September 12 2023, @07:47PM (#1324293)

            I got out of ChemEng right before the enforced problem class where we would have sat in a room and worked chem eng problems as a class.

            I would imagine it would be possible to have "enforced pair programming" in CS classes at a scheduled time as a "lab". We did that in middle and high school, never post-high school but it should be possible.

            The "lab" concept is interesting, hard science has been doing that awhile and it should be possible to enforce a "lab" concept for humanities essay writing classes and similar.

            • (Score: 2) by aafcac on Wednesday September 13 2023, @02:20AM

              by aafcac (17646) on Wednesday September 13 2023, @02:20AM (#1324349)

              The last time I took a science class we had a neat virtual lab to work with. I probably would've hated that aspect of science classes so much if I could get the relevant details when I needed them. I remember failing an assignment because I couldn't tell the fruit flies apart because they weren't labeled and I couldn't get the professor you explain what I was supposed to be looking for.

  • (Score: 3, Insightful) by VLM on Tuesday September 12 2023, @02:58PM

    by VLM (445) on Tuesday September 12 2023, @02:58PM (#1324218)

    I have a modest proposal which extends my remarks in another comment.

    In my experience Fing around with the current marketing push of AI, its unimaginably good at distilling English prose into something smaller like an essay or definition. Its also unimaginably bad at anything it can't distill 1000:1 or that requires reasoning or counting or visuospatial talents or really anything other than eating lots of lit and spitting out a little lit.

    If you notice carefully, examples of the "staggering power of AI" focus on stuff that is a summary of gigabytes of rehashed topics. There must be a million FizzBuzz online, so its pretty good at distilling a million of them into a working example. Or there must be a thousand essays both online and in books and whatever else AI eats, legally or illegally, about "Euthyphro vs Crito" by Socrates so it can upon request give you a great essay about E vs C or rephrase it into an epic rap battle or a sad country western song or whatever other output filter you'd like.

    However, if you push the limits into realms never discussed, its lost. So AI give me an essay about some new meme from /b/ vs Phaedo's dialogue and its just lost. "So there's this green frog and an ancient greek philosopher and ..." Another good laugh is any mathematics. Or counting and for extra AI torture mix in ordinal vs cardinal numbers. In my experience AI is lost when doing topology beyond word-chopping definitions and similar rephrasing.

    So essentially its a plagiarism machine. If its been talked about online or written in a book, it'll average together what it's seen and give the correct ish answer.

    Two educational strategies:

    Adjust the curriculum to focus on higher level topics. We're always going to have AI around to give definitions so no need to memorize them anyway. Its assumed if you're a math major you don't need to start the undergrads with 2+2=4. Well, apply this to the humanities and engineering and science. No fizzbuzz for you, you walk in first day of freshman CS and bam first day its database theory and filesystems and hard RTOS concurrency.

    Adjust the curriculum to be research focused instead of stack ranking vocational candidates. Before continental drift was accepted, asking AI about it would get you canned anti-conspiracy theorist responses. However, hypothetically, if you assigned 25 undergrads to work with Dr Wegener a century ago or whatever, you could grade them on actual human work. The AI will refuse to work on it so this is perfect for grading humans. If you argued with an AI in 2017 about corona viruses originating from China the AI political filter would shut you down, the government filtering would shut you down, the anti-conspiracy theorist filters would shut you down, etc. (BTW that's how you know something is true in the 2020s LOL) but humans were indeed working on novel lab-created corona viruses, obviously, and students could be graded on attempting to help research. Not sure if throwing half the undergrad class into research would net speed things up or net slow things down. Remember some of those kids would have had their credentials in a couple years in the old days, so they'll be of help. Of course the deadweight will be deadweight, but that's been the case for all of human history so maybe no loss?

    So, in summary, delete the lower level classes from higher education for the freshmen and sophomores, and toss all the juniors and seniors directly into cutting edge research.

(1)