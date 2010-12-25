AI favors texts written by other AIs, even when they're worse than human ones:
As many of you already know, I'm a university professor. Specifically, I teach artificial intelligence at UPC.
Each semester, students must complete several projects in which they develop different AI systems to solve specific problems. Along with the code, they must submit a report explaining what they did, the decisions they made, and a critical analysis of their results.
Obviously, most of my students use ChatGPT to write their reports.
So this semester, for the first time, I decided to use a language model myself to grade their reports.
The results were catastrophic, in two ways:
- The LLM wasn't able to follow my grading criteria. It applied whatever criteria it felt like, ignoring my prompts. So it wasn't very helpful.
- The LLM loved the reports clearly written with ChatGPT, rating them higher than the higher-quality reports written by students.
In this post, I'll share my thoughts on both points. The first one is quite practical; if you're a teacher, you'll find it useful. I'll include some strategies and tricks to encourage good use of LLMs, detect misuse, and grade more accurately.
The second one... is harder to categorize and would probably require a deeper study, but I think my preliminary observations are fascinating on their own.
[...] If you're a teacher and you're thinking of using LLMs to grade assignments or exams, it's worth understanding their limitations.
We should think of a language model as a "very smart intern": fresh out of college, with plenty of knowledge, but not yet sure how to apply it in the real world to solve problems. So we must be extremely detailed in our prompts and patient in correcting its mistakes—just as we would be if we asked a real person to help us grade.
In my tests, I included the full project description, a detailed grading rubric, and several elements of my personal judgment to help it understand what I look for in an evaluation.
[...] The usual hallucinations began—the kind I thought were mostly solved in newer model versions. But apparently not: it was completely making up citations from the reports.
[...] Soon after, it started inventing its own grading criteria. I couldn't get it to follow my rubric at all. I gave up and decided to treat its feedback simply as an extra pair of eyes, to make sure I wasn't missing anything.
[...] Instead of asking the LLM to identify AI-written texts, which it doesn't do very well, I decided to compare my own quality ratings of each project with the LLM's ratings. Basically, I wanted to see how aligned our criteria were.
And I found a fascinating pattern: the AI gives artificially high scores to reports written with AI.
The models perceive LLM-written reports as more professional and of higher quality. They prioritize form over substance.
And I'm not saying that style isn't important, because it is, in the real world. But it was giving very high marks to poorly reasoned, error-filled work simply because it was elegantly written. Too elegantly... Clearly written with ChatGPT.
When I asked the model what it based its evaluation on, it said things like: "Well, the students didn't literally write [something]... I inferred it from their abstract, which was very well written."
In other words, good writing produced by one LLM leads to a good evaluation by another LLM, even if the content is wrong.
Meanwhile, good writing by a student doesn't necessarily lead to a good evaluation by an LLM.
This phenomenon has a name: corporatism.
[...] This situation gives me chills, because we have totally normalized using LLMs to filter résumés, proposals, or reports.
I don't even want to imagine how many users are accepting these evaluations without supervision and without a hint of critical thought.
If we, as humans, abdicate our responsibility as critical evaluators, we'll end up in a world dominated by AI corporatism.
A world where machines reward laziness and punish real human effort.
[...] To make sure students haven't overused ChatGPT, professors conduct short face-to-face interviews to discuss their projects.
It's the only way to ensure they've actually learned, and also, to be fair. If they've used the model to write more clearly and effectively but still achieved the learning objectives and understood their work, we don't penalize them.
In general, when a report smells a lot like ChatGPT, it usually means the students didn't learn much. But there are always surprises, in both directions.
Sometimes, it's legitimate use of ChatGPT as a writing assistant, which I actually encourage in class. Other times, I find reports that seem AI-written, but the students swear up and down they weren't, even after I tell them it won't affect their grade.
Maybe it's that humans are starting to write like machines.
Of course, machines have learned to write like humans—but current models still have a rigid, recognizable, and rather bland style. You can spot the overuse of bullet-pointed infinitives packed with adjectives, endless summary paragraphs, and phrasing or structures no human would naturally use.
(Score: 2) by krishnoid on Thursday December 11, @03:10PM
Which one? Did ChatGPT like what ChatGPT wrote, or was it another one that was likely trained in the same way? For example, I would *LOVE* to see how an LLM trained on content in one human language, evaluates content written in another.
(Score: 3, Interesting) by Mojibake Tengu on Thursday December 11, @03:45PM
When I see a complete and correct RISC-V/64 macroassembler written by a LLM in Haskell programming language, I'll start to believe generative AIs are actually useful for something.
Rust programming language offends both my Intelligence and my Spirit.
(Score: 4, Informative) by ikanreed on Thursday December 11, @03:57PM
Once again, this is a problem that arises from thinking that LLMs are thinking.
They are not. They are linguistic pattern recognition and regurgitation machines.
And they have a huge library of training data in the form of graded essays. And the patterns within that training set give higher grades to the jargon-filled, ponderous academic writing that also makes up the LLMs post training to make it sound smarter to your average idiot user.
It has much weaker correlations in its latents with specific rubric guidelines than it does with sounding like an ivory tower prat.
People have got to stop treating them like they're following instructions the way a person does.