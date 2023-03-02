from the plausible-sentence-generators-applied-to-code dept.
The Association for Computing Machinery has a post by George Neville-Neil of FreeBSD fame comparing LLMs to drunken plagiarists:
Before trying to use these tools, you need to understand what they do, at least on the surface, since even their creators freely admit they do not understand how they work deep down in the bowels of all the statistics and text that have been scraped from the current Internet. The trick of an LLM is to use a little randomness and a lot of text to Gauss the next word in a sentence. Seems kind of trivial, really, and certainly not a measure of intelligence that anyone who understands the term might use. But it's a clever trick and does have some applications.
[...] While help with proper code syntax is a boon to productivity (consider IDEs that highlight syntactical errors before you find them via a compilation), it is a far cry from SEMANTIC knowledge of a piece of code. Note that it is semantic knowledge that allows you to create correct programs, where correctness means the code actually does what the developer originally intended. KV can show many examples of programs that are syntactically?but not semantically?correct. In fact, this is the root of nearly every security problem in deployed software. Semantics remains far beyond the abilities of the current AI fad, as is evidenced by the number of developers who are now turning down these technologies for their own work.
He continues by pointing out how LLMs are not only based on plagiarism, they are unable provide useful annotation in the comments or otherwise address the semantics of the code they swipe.
Previously:
(2024) Make Illegally Trained LLMs Public Domain as Punishment
(2024) The Open Secret Of Open Washing
(2023) A Jargon-Free Explanation of How AI Large Language Models Work
(2019) AI Training is *Very* Expensive
... and many more.
Related Stories
Tech Review reports on a study of the energy (carbon) costs of training an AI to do natural language processing and compares to the lifecycle costs of cars,
https://www.technologyreview.com/s/613630/training-a-single-ai-model-can-emit-as-much-carbon-as-five-cars-in-their-lifetimes/
In a new paper, researchers at the University of Massachusetts, Amherst, performed a life cycle assessment for training several common large AI models. They found that the process can emit more than 626,000 pounds of carbon dioxide equivalent—nearly five times the lifetime emissions of the average American car (and that includes manufacture of the car itself).
It’s a jarring quantification of something AI researchers have suspected for a long time. “While probably many of us have thought of this in an abstract, vague level, the figures really show the magnitude of the problem,” says Carlos Gómez-Rodríguez, a computer scientist at the University of A Coruña in Spain, who was not involved in the research. “Neither I nor other researchers I’ve discussed them with thought the environmental impact was that substantial.”
In the grand scheme of things, five cars out of the millions made every year isn't a very big deal...but your faithful AC would never have guessed that it took anywhere near that much energy.
https://arstechnica.com/science/2023/07/a-jargon-free-explanation-of-how-ai-large-language-models-work/
When ChatGPT was introduced last fall, it sent shockwaves through the technology industry and the larger world. Machine learning researchers had been experimenting with large language models (LLMs) for a few years by that point, but the general public had not been paying close attention and didn't realize how powerful they had become.
Today, almost everyone has heard about LLMs, and tens of millions of people have tried them out. But not very many people understand how they work.
Arthur T Knackerbracket has processed the following story:
If you believe Mark Zuckerberg, Meta's AI large language model (LLM) Llama 3 is open source.
It's not, despite what he says. The Open Source Initiative (OSI) spells it out in the Open Source Definition, and Llama 3's license – with clauses on litigation and branding – flunks it on several grounds.
Meta, unfortunately, is far from unique in wanting to claim that some of its software and models are open source. Indeed, the concept has its own name: open washing.
This is a deceptive practice in which companies or organizations present their products, services, or processes as "open" when they are not truly open in the spirit of transparency, access to information, participation, and knowledge sharing. This term is modeled after "greenwashing" and was coined by Michelle Thorne, an internet and climate policy scholar, in 2009.
With the rise of AI, open washing has become commonplace, as shown in a recent study. Andreas Liesenfeld and Mark Dingemanse of Radboud University's Center for Language Studies surveyed 45 text and text-to-image models that claim to be open. The pair found that while a handful of lesser-known LLMs, such as AllenAI's OLMo and BigScience Workshop + HuggingFace with BloomZ could be considered open, most are not. Would it surprise you to know that according to the study, the big-name ones from Google, Meta, and Microsoft aren't? I didn't think so.
But why do companies do this? Once upon a time, companies avoided open source like the plague. Steve Ballmer famously proclaimed in 2001 that "Linux is a cancer," because: "The way the license is written, if you use any open source software, you have to make the rest of your software open source." But that was a long time ago. Today, open source is seen as a good thing. Open washing enables companies to capitalize on the positive perception of open source and open practices without actually committing to them. This can help improve their public image and appeal to consumers who value transparency and openness.
Arthur T Knackerbracket has processed the following story:
Last year, I wrote a piece here on El Reg about being murdered by ChatGPT as an illustration of the potential harms through the misuse of large language models and other forms of AI.
Since then, I have spoken at events across the globe on the ethical development and use of artificial intelligence – while still waiting for OpenAI to respond to my legal demands in relation to what I've alleged is the unlawful processing of my personal data in the training of their GPT models.
In my earlier article, and my cease-and-desist letter to OpenAI, I stated that such models should be deleted.
Essentially, global technology corporations have decided, rightly or wrongly, the law can be ignored in their pursuit of wealth and power.
Household names and startups have, and still are, scraping the internet and media to train their models, typically without paying for it and while arguing they are doing nothing wrong. Unsurprisingly, a number of them have been fined or are settling out of court after being accused of breaking rules covering not just copyright but also online safety, privacy, and data protection. Big Tech has brought private litigation and watchdog scrutiny upon it, and potentially engendered new laws to fill in any regulatory gaps.
But for them, it's just a cost of business.
[...] After careful consideration over the time between my previous piece here on El Reg and now, I have come to a different opinion with regards to the deletion of these fruits, however. Not because I believe I was wrong, but because of moral and ethical considerations due to the potential environmental impact.
[...] In light of this information, I am forced to reconcile the ethical impact on the environment should such models be deleted under the "fruit of the poisonous tree" doctrine, and it is not something that can be reconciled as the environmental cost is too significant, in my view.
(Score: 2, Disagree) by DrkShadow on Tuesday January 28, @04:44AM
> pointing out how LLMs are not only based on plagiarism
Again and again, if LLM's are "plagiarizing" or "memorizing" or anything of the sort, then we humans have a problem - for everything that we say or draw, being based on something that we took in, is plagiarism, or copying, in exactly the same way that the LLM does it. Our neurons have weighted links, grouped by similarity, layers on layers, just like LLM networks.
LLMs learn, like human brains learn. If we start equating learning and plagiarism -- just wait for that whole new kind of rent-seeking from capitalists.. everything that you intentionally read or heard or saw will, forever more, subject you to royalty payments.
It's not a matter of one being more complex than the other. LLM's adaptation of information is so complex and complete that, lawsuits have revealed, it's *simply not possible* to get the source material from the evolved result of the model. The brain and the LLM are the same, with respect to processing and learning information.