
from the plausible-sentence-generators-applied-to-code dept.
The Association for Computing Machinery has a post by George Neville-Neil of FreeBSD fame comparing LLMs to drunken plagiarists:
Before trying to use these tools, you need to understand what they do, at least on the surface, since even their creators freely admit they do not understand how they work deep down in the bowels of all the statistics and text that have been scraped from the current Internet. The trick of an LLM is to use a little randomness and a lot of text to Gauss the next word in a sentence. Seems kind of trivial, really, and certainly not a measure of intelligence that anyone who understands the term might use. But it's a clever trick and does have some applications.
[...] While help with proper code syntax is a boon to productivity (consider IDEs that highlight syntactical errors before you find them via a compilation), it is a far cry from SEMANTIC knowledge of a piece of code. Note that it is semantic knowledge that allows you to create correct programs, where correctness means the code actually does what the developer originally intended. KV can show many examples of programs that are syntactically?but not semantically?correct. In fact, this is the root of nearly every security problem in deployed software. Semantics remains far beyond the abilities of the current AI fad, as is evidenced by the number of developers who are now turning down these technologies for their own work.
He continues by pointing out how LLMs are not only based on plagiarism, they are unable provide useful annotation in the comments or otherwise address the semantics of the code they swipe.
Previously:
(2024) Make Illegally Trained LLMs Public Domain as Punishment
(2024) The Open Secret Of Open Washing
(2023) A Jargon-Free Explanation of How AI Large Language Models Work
(2019) AI Training is *Very* Expensive
... and many more.
« Data Breach Hitting PowerSchool Looks Very, Very Bad | First-Ever Data Center On The Moon Set To Launch Next Month »
Related Stories
Tech Review reports on a study of the energy (carbon) costs of training an AI to do natural language processing and compares to the lifecycle costs of cars,
https://www.technologyreview.com/s/613630/training-a-single-ai-model-can-emit-as-much-carbon-as-five-cars-in-their-lifetimes/
In a new paper, researchers at the University of Massachusetts, Amherst, performed a life cycle assessment for training several common large AI models. They found that the process can emit more than 626,000 pounds of carbon dioxide equivalent—nearly five times the lifetime emissions of the average American car (and that includes manufacture of the car itself).
It’s a jarring quantification of something AI researchers have suspected for a long time. “While probably many of us have thought of this in an abstract, vague level, the figures really show the magnitude of the problem,” says Carlos Gómez-Rodríguez, a computer scientist at the University of A Coruña in Spain, who was not involved in the research. “Neither I nor other researchers I’ve discussed them with thought the environmental impact was that substantial.”
In the grand scheme of things, five cars out of the millions made every year isn't a very big deal...but your faithful AC would never have guessed that it took anywhere near that much energy.
When ChatGPT was introduced last fall, it sent shockwaves through the technology industry and the larger world. Machine learning researchers had been experimenting with large language models (LLMs) for a few years by that point, but the general public had not been paying close attention and didn't realize how powerful they had become.
Today, almost everyone has heard about LLMs, and tens of millions of people have tried them out. But not very many people understand how they work.
Arthur T Knackerbracket has processed the following story:
If you believe Mark Zuckerberg, Meta's AI large language model (LLM) Llama 3 is open source.
It's not, despite what he says. The Open Source Initiative (OSI) spells it out in the Open Source Definition, and Llama 3's license – with clauses on litigation and branding – flunks it on several grounds.
Meta, unfortunately, is far from unique in wanting to claim that some of its software and models are open source. Indeed, the concept has its own name: open washing.
This is a deceptive practice in which companies or organizations present their products, services, or processes as "open" when they are not truly open in the spirit of transparency, access to information, participation, and knowledge sharing. This term is modeled after "greenwashing" and was coined by Michelle Thorne, an internet and climate policy scholar, in 2009.
With the rise of AI, open washing has become commonplace, as shown in a recent study. Andreas Liesenfeld and Mark Dingemanse of Radboud University's Center for Language Studies surveyed 45 text and text-to-image models that claim to be open. The pair found that while a handful of lesser-known LLMs, such as AllenAI's OLMo and BigScience Workshop + HuggingFace with BloomZ could be considered open, most are not. Would it surprise you to know that according to the study, the big-name ones from Google, Meta, and Microsoft aren't? I didn't think so.
But why do companies do this? Once upon a time, companies avoided open source like the plague. Steve Ballmer famously proclaimed in 2001 that "Linux is a cancer," because: "The way the license is written, if you use any open source software, you have to make the rest of your software open source." But that was a long time ago. Today, open source is seen as a good thing. Open washing enables companies to capitalize on the positive perception of open source and open practices without actually committing to them. This can help improve their public image and appeal to consumers who value transparency and openness.
Arthur T Knackerbracket has processed the following story:
Last year, I wrote a piece here on El Reg about being murdered by ChatGPT as an illustration of the potential harms through the misuse of large language models and other forms of AI.
Since then, I have spoken at events across the globe on the ethical development and use of artificial intelligence – while still waiting for OpenAI to respond to my legal demands in relation to what I've alleged is the unlawful processing of my personal data in the training of their GPT models.
In my earlier article, and my cease-and-desist letter to OpenAI, I stated that such models should be deleted.
Essentially, global technology corporations have decided, rightly or wrongly, the law can be ignored in their pursuit of wealth and power.
Household names and startups have, and still are, scraping the internet and media to train their models, typically without paying for it and while arguing they are doing nothing wrong. Unsurprisingly, a number of them have been fined or are settling out of court after being accused of breaking rules covering not just copyright but also online safety, privacy, and data protection. Big Tech has brought private litigation and watchdog scrutiny upon it, and potentially engendered new laws to fill in any regulatory gaps.
But for them, it's just a cost of business.
[...] After careful consideration over the time between my previous piece here on El Reg and now, I have come to a different opinion with regards to the deletion of these fruits, however. Not because I believe I was wrong, but because of moral and ethical considerations due to the potential environmental impact.
[...] In light of this information, I am forced to reconcile the ethical impact on the environment should such models be deleted under the "fruit of the poisonous tree" doctrine, and it is not something that can be reconciled as the environmental cost is too significant, in my view.
(Score: 5, Insightful) by DrkShadow on Tuesday January 28, @04:44AM (15 children)
> pointing out how LLMs are not only based on plagiarism
Again and again, if LLM's are "plagiarizing" or "memorizing" or anything of the sort, then we humans have a problem - for everything that we say or draw, being based on something that we took in, is plagiarism, or copying, in exactly the same way that the LLM does it. Our neurons have weighted links, grouped by similarity, layers on layers, just like LLM networks.
LLMs learn, like human brains learn. If we start equating learning and plagiarism -- just wait for that whole new kind of rent-seeking from capitalists.. everything that you intentionally read or heard or saw will, forever more, subject you to royalty payments.
It's not a matter of one being more complex than the other. LLM's adaptation of information is so complex and complete that, lawsuits have revealed, it's *simply not possible* to get the source material from the evolved result of the model. The brain and the LLM are the same, with respect to processing and learning information.
(Score: 2, Insightful) by khallow on Tuesday January 28, @05:17AM
(Score: 5, Insightful) by canopic jug on Tuesday January 28, @05:23AM (5 children)
Large Language Models shorten, they don't (and can't) summarize. (Sorry, no link handy for the difference between shortening and summarization.) It's just their nature.
LLMs can even be built in SQL [explainextended.com], they're that simple. Throwing more data and more CPUs at them only uses more electricity without introducing any learning or comprehension. It's all about being statistic models and not intelligence at any level.
It's not accurate to equate LLMs with brains. The neural nets there, if any, juggle statistical probability not inferences. For there to be comprehension and learning, there has to be intelligence and that is just not a characteristic of Large Language Models. LLMs are merely statistics based plausible sentence generators which produce grammatically correct sentences through the statistical probability of word sequences, that has nothing to do with semantics or facts let alone learning. Again, see the SQL link in the previous paragraph.
However, there is a lot of hype and over promising. Yet, LLMs have already maxed out in capabilities. Again, we've already seen that throwing more electricity and data at the problem cannot help. Furthermore, due to the inherent limitations they do not and cannot provide any useful service unless their ability to quickly produce disinformation at scale can be considered useful.
Money is not free speech. Elections should not be auctions.
(Score: 5, Informative) by DrkShadow on Tuesday January 28, @05:44AM (2 children)
Sorry, you've probably hit a fundamental thing that I wasn't considering.
I really meant Deep Leaning, and/or cognitive normal nets. I tend to use the terms interchangeably.. I'm not familiar with LLMs enough to know if they are(n't) the former. : )
Even so, i feel as long as LLMs are statistical inference machines, then they are not copying or storing any static, singular, or whole things - nor fragments thereof - and so cannot be plagiarizing. There's no sequence within them, without external instruction, that would generate an original entry.
(Score: 3, Insightful) by shrewdsheep on Tuesday January 28, @08:56AM (1 child)
Wouldn't this be a prerequisite to engage in the discussion meaningfully?
Well, they are perfectly able of "copying or storing any static, singular, or whole things - or fragments thereof".
A meaningful starting point is to define plagiarims. To me, reusing information provided by other and having been produced by mental and creative effort to produce own work without adding own mental or creative contribution without permission or attribution constitutes plagiarism. LLMs check those boxes.
(Score: 2, Flamebait) by DrkShadow on Tuesday January 28, @04:54PM
Could you.. first.. try and put that into english? Seriously though - try and read that..
Then you might need to define "mental" and "creative effort," at least some of part which it sounds like you're defining as "biological".
(Score: 4, Interesting) by sjames on Tuesday January 28, @03:24PM (1 child)
LLMs behave very much as might be expected if somehow the language areas of the brain were excised and maintained in a jar (damage and all!). They often resemble a person with fluent aphasia. It's not plagiarism but it's usefulness is very much questionable ion most areas.
LLMs seem to be good at writing ad-copy (almost always disinformation anyway) and getting lawyers in trouble (also often disinformation but it's supposed to be wrapped in truth and logic).
Perhaps the real problem is that a lot of what passes for erudition in the modern world is actually just monkey chatter and LLMs are pulling the curtain back.
(Score: 2) by VLM on Tuesday January 28, @03:44PM
I would agree with your remarks and extend them such that I've seen a lot of behavior at big corporate that boils down to one dept enters a project request for another dept to build a flux capacitor into a DeLorean and the pushback is "hmm OK very interesting write a 200 page project spec and get back to us". Now a LLM will "write" that 200 page project spec. My point being that inside a megacorp, any department that does not have a LLM will be crushed underneath the other departments, but any megacorp that does not ban the use of LLMs for internal warfare will be crushed by their external competition.
All you need is one competitor in a market not using LLMs and their productivity will be higher and they'll crush the LLM-using competition. Weirdly its probably reversed for small companies, but small companies don't have the money to pay for LLMs.
Everyone wants to be the Broadcom's vmware of LLMs and start billing CAD/CAM-level SaaS fees for every employee at huge corporations. Ironically I don't think anyone will be that.
(Score: 4, Insightful) by Anonymous Coward on Tuesday January 28, @09:58AM (7 children)
And "normal humans" can and do get in trouble for copyright infringement when we redistribute other people's stuff without proper permission/licensing. Even if it was already publicly available on the Internet.
In contrast, the AI bunch will get a free pass "because AI"? They're redistributing that stuff and often for profit.
If infringement is allowed just because the copies are not 100% identical to the originals then please point to the relevant laws. I'm sure some pirate sites will be happy to abuse those laws.
See also: https://en.wikipedia.org/wiki/Rogers_v._Koons [wikipedia.org]
(Score: 2) by PiMuNu on Tuesday January 28, @02:21PM (5 children)
> the AI bunch will get a free pass "because AI"?
Because in general they cover their tracks well enough that we can't go from derived work to the source.
The exception may occur if I say "reproduce a story by $AUTHOR". If the AI can reproduce a convincing work by $AUTHOR, it seems that this is evidence of plagiarism - I don't know whether the courts agree.
(Score: 2) by sjames on Tuesday January 28, @03:12PM (1 child)
If you can't get from the derived work to the source, it is NOT plagiarism.
Producing an original work in the style of another is NOT plagiarism.
(Score: 2) by PiMuNu on Tuesday January 28, @03:47PM
Unless you use the same (named) characters and setting. The minute Frodo or Rivendell turns up, its copyrighted.
(Score: 2) by VLM on Tuesday January 28, @03:17PM (2 children)
Generally they don't. There's a booming trade in what amounts to fanfic for decades now. Tons of "Cthulhu Mythos-Universe" gets shoveled that tries to imitate HPL's style. Same for Tolkien and others. Seems to be tradition for decades now that if a scifi or fantasy author dies with an unfinished series in the works, someone takes up the torch and finishes it off, usually a family member who's not going to sue himself, but not aways.
Its usually not very good stuff, but it does make "some" revenue.
I think the future will look a lot like the latest LOTR movie. AFAIK it has no connection to the Tolkien family other than they got a licensing fee and the writers read his work. I saw it; its NOT great but its not the worst two hours of my life either. Better than a night watching TV at home. The budget was $30M and the total revenue so far is $20M, oops. Now the reason I see it as the future is you could turn that $10M loss into a near $20M profit merely by replacing the entire production process with asking an AI LLM to generate an Anime LOTR movie about the old days of Rohan. Or break even with 50:50 LLM generated vs human generated.
If it hadn't been released the same day as some kids movie sequel, it probably would have done better? The characters were weak, an LLM couldn't do worse. The animation and graphics art was pretty amazing in a stylistic sense. I think the execs were trying for an artistic interpretation of "long ago and far away" and the audience didn't get "it" and felt it was just shitty production; but it was very careful shitty production, like designer jeans with factory made rips and cuts and stains.
Anyway, the latest LOTR movie could have run a profit if they cheaped out and LLM'd about $10M worth of its production. Don't even have to do the whole thing via LLM, just 1/3. I see a lot more slop in our future. As slop goes, its not even bad slop; it was an enjoyable-enough 2 hours although I'd not pay to repeat it...
(Score: 2) by PiMuNu on Tuesday January 28, @03:50PM (1 child)
> > I don't know whether the courts agree.
> Generally they don't.
> the latest LOTR movie... no connection to the Tolkien family other than they got a licensing fee and the writers read his work
On the other hand, you assert the LOTR movie did pay a license fee, so in this sense they did pay off the copyright lawyers. In a legal sense they covered their arse.
(Score: 2) by VLM on Tuesday January 28, @04:38PM
Yes they paid to use a lot of licensed IP, characters, actors, voices, images, art, right from the old movies.
I suspect if they used a LLM to produce cheaper slop they'd make more money than trying to lure in viewers via "AI write me a movie script about the adventures of plains horsemen in the old days"
(Score: 3, Insightful) by DannyB on Tuesday January 28, @02:50PM
There are various factors that determine if something is copyright infringement under law.
Just being similar may not be copyright infringement unless it damages the marketability of the original work.
It is humans, not AI that engage in copyright infringement. AI is just a tool. Like Photoshop is just a tool. Or bit torrent is just a tool.
Stop asking "How stupid can you be?" Some people apparently take it as a challenge.
(Score: 0) by Anonymous Coward on Tuesday January 28, @02:26PM (3 children)
These things only work if you already know the answer. They literally train on how to map input to output, which means you must have both. The internal mechanisms are impossible to figure out - a sequence of switches, perhaps thousands.
I like to compare with linear algebra solving Ax=b. Classically we put all our knowledge into the "system" matrix A and use a general inverse that is valid for all possible A's. The NN infers A (or nonlinear version of it) by training on a vast number of x's and b's. This is still a very small subspace compared to the general space containing these vectors. The NN is a niche solution valid only in a tiny subspace but it learns system non-idealities valid in that niche space that are too hard to model.
As long as you know the answer you want (biases and bullshit included) then you can generate a mapping that will reproduce x from a given b. It has no guarantees of anything, unlike general inverses which at least try to assert optimality under certain classes of error (mainly additive Gaussian noise).
(Score: 2) by DannyB on Tuesday January 28, @02:58PM (1 child)
Have you read about the DeepSeek R1 [github.com] which the Chinese just released?
This open source AI crushes everything - DeepSeek R1 [youtube.com]
Stop asking "How stupid can you be?" Some people apparently take it as a challenge.
(Score: 2) by VLM on Tuesday January 28, @03:29PM
There's some of six vs half-a-dozen going on.
If your RL signal is generated by externally matching the output to a goal (social media updoots, but for AI, more or less) and feeding it back, thats just implementing ghetto SFT with extra steps to pretend its not SFT.
There's a philosophical difference where with SFT you match to an answer you already know and with RL you match to the response you haven't gotten yet, but from a distance its "about the same"
As far as I know there's no reason to anticipate you'd get worse results if you sprinkle small amounts of SFT on top; I think the point of your quote is they're bragging its not even as good as it could get because its half baked. Or bragging that their RL is SO much better that they kick butt over competitors RL+SFT productions (perhaps with an implication their superior RL plus some unreleased SFT will be REALLY REALLY good)
(Score: 2) by DrkShadow on Tuesday January 28, @05:16PM
> I like to compare with linear algebra solving Ax=b.
But this doesn't fully capture it. You would get random data -- you have more unknowns than knowns. You need to either put in your prompt,
A_1 x = b_1
Or you need to include your prompt in the X,
A x p = b
Without the prompt, there is no output at all. If you get a plagiarized work OUT, it's because you built a prompt specifically to massage a collection of words into your desired output. This isn't plagiarism on the part of the machine -- this is intentional production of output on the part of the prompter.
(Score: 3, Interesting) by VLM on Tuesday January 28, @03:03PM
I mostly agree with the guy, however there's a competency crisis, for a variety of reasons, and a noobs idea of semantics is an experienced programmer's idea of mere syntax. I'm thinking of the immense gulf between a noob who can barely comprehend the very idea of a buffer overflow vs an old timer who literally typo'd or brain fogged a strcpy instead of a strncpy or similar "knows better but made a typo" level of mistake.
Most software is made by inexperienced noobs and stack on top that most security mistakes are made by even more inexperienced noobs and its not looking good out there.
I find it amusing that in 2025 we're still teaching Arduino noobs that strcpy.
The irony is if you know your problem space and your attackers abilities, any of strcpy, strncpy or snprintf work. But if you make some mistakes about zero terminators strncpy has issues, or there's issues with snprintf and the 3 arg call vs the 4 arg call (something about if you don't format specify a %s you can theroretically stack smash if the attacker could control whats passed to a 3-arg snprintf, from memory, probably incorrectly)
The final security thing that LLMs probably won't help with is when "just screwing around on -dev using strcpy" somehow magically gets promoted due to management failure to -prod. You could say all code you ever write should be infinitely 100% defended secure, but IRL there's too much experimenting that's only temporary unless it works that gets promoted to -prod and then everyone looks confused about how that ever got into -prod or got shipped...