Stories
Slash Boxes
Comments

SoylentNews is people

posted by hubie on Wednesday December 25, @09:47PM   Printer-friendly

Arthur T Knackerbracket has processed the following story:

Last year, I wrote a piece here on El Reg about being murdered by ChatGPT as an illustration of the potential harms through the misuse of large language models and other forms of AI.

Since then, I have spoken at events across the globe on the ethical development and use of artificial intelligence – while still waiting for OpenAI to respond to my legal demands in relation to what I've alleged is the unlawful processing of my personal data in the training of their GPT models.

In my earlier article, and my cease-and-desist letter to OpenAI, I stated that such models should be deleted.

Essentially, global technology corporations have decided, rightly or wrongly, the law can be ignored in their pursuit of wealth and power.

Household names and startups have, and still are, scraping the internet and media to train their models, typically without paying for it and while arguing they are doing nothing wrong. Unsurprisingly, a number of them have been fined or are settling out of court after being accused of breaking rules covering not just copyright but also online safety, privacy, and data protection. Big Tech has brought private litigation and watchdog scrutiny upon it, and potentially engendered new laws to fill in any regulatory gaps.

But for them, it's just a cost of business.

[...] After careful consideration over the time between my previous piece here on El Reg and now, I have come to a different opinion with regards to the deletion of these fruits, however. Not because I believe I was wrong, but because of moral and ethical considerations due to the potential environmental impact.

[...] In light of this information, I am forced to reconcile the ethical impact on the environment should such models be deleted under the "fruit of the poisonous tree" doctrine, and it is not something that can be reconciled as the environmental cost is too significant, in my view.

So what can we do to ensure those who scrape the Web for commercial gain (in the case of training AI models) do not profit, do not gain an economic advantage, from such controversial activities? And furthermore, if disgorgement (through deletion) is not viable due to the consideration given above, how can we incentivize companies to treat people’s privacy and creative work with respect as well as being in line with the law when developing products and services?

After all, if there is no meaningful consequence – as stated, today's monetary penalties are merely line items for these companies, which have more wealth than some nations, and as such are ineffectual as a deterrent – we will continue to see this behavior repeated ad infinitum which simply maintains the status quo and makes a mockery of the rule of law.

It seems to me the only obvious solution here is to remove these models from the control of executives and put them into the public domain. Given they were trained on our data, it makes sense that it should be public commons – that way we all benefit from the processing of our data and the companies, particularly those found to have broken the law, see no benefit. The balance is returned, and we have a meaningful deterrent against those who seek to ignore their obligations to society.

Under this solution, OpenAI, if found to have broken the law, would be forced to put its GPT models in the public domain and even banned from selling any services related to those models. This would result in a significant cost to OpenAI and its backers, which have spent billions developing these models and associated services. They would face a much higher risk of not being able to recover these costs through revenues, which in turn would force them to do more due diligence with regards to their legal obligations.

If we then extend this model to online platforms that sell their users’ data to companies such as OpenAI - where they are banned from providing such access with the threat of disgorgement - they would also think twice before handing over personal data and intellectual property.

If we remove the ability for organizations to profit from illegal behavior, while also recognizing the ethical issues of destroying the poisonous fruit, we might finally find ourselves in a situation where companies with immense power are forced to comply with their legal obligations simply as a matter of economics.


Original Submission

Related Stories

The Drunken Plagiarists: Working with Co-pilots 21 comments

The Association for Computing Machinery has a post by George Neville-Neil of FreeBSD fame comparing LLMs to drunken plagiarists:

Before trying to use these tools, you need to understand what they do, at least on the surface, since even their creators freely admit they do not understand how they work deep down in the bowels of all the statistics and text that have been scraped from the current Internet. The trick of an LLM is to use a little randomness and a lot of text to Gauss the next word in a sentence. Seems kind of trivial, really, and certainly not a measure of intelligence that anyone who understands the term might use. But it's a clever trick and does have some applications.

[...] While help with proper code syntax is a boon to productivity (consider IDEs that highlight syntactical errors before you find them via a compilation), it is a far cry from SEMANTIC knowledge of a piece of code. Note that it is semantic knowledge that allows you to create correct programs, where correctness means the code actually does what the developer originally intended. KV can show many examples of programs that are syntactically?but not semantically?correct. In fact, this is the root of nearly every security problem in deployed software. Semantics remains far beyond the abilities of the current AI fad, as is evidenced by the number of developers who are now turning down these technologies for their own work.

He continues by pointing out how LLMs are not only based on plagiarism, they are unable provide useful annotation in the comments or otherwise address the semantics of the code they swipe.

Previously:
(2024) Make Illegally Trained LLMs Public Domain as Punishment
(2024) The Open Secret Of Open Washing
(2023) A Jargon-Free Explanation of How AI Large Language Models Work
(2019) AI Training is *Very* Expensive
... and many more.


Original Submission

This discussion was created by hubie (1068) for logged-in users only, but now has been archived. No new comments can be posted.
Display Options Threshold/Breakthrough Mark All as Read Mark All as Unread
The Fine Print: The following comments are owned by whoever posted them. We are not responsible for them in any way.
(1)
  • (Score: 0, Insightful) by Anonymous Coward on Wednesday December 25, @11:55PM

    by Anonymous Coward on Wednesday December 25, @11:55PM (#1386450)

    "Opinion" pieces on El Reg are always such shit you should just keep on scrolling.

    So: "illegally trained" (means what?) LLM "contains" (lol they don't work like that) copyrighted material, and that means it has to be "open sourced". And what does *that* mean? Then, every output from this LLM is free of copyright encumbrance? So given any input, *anything* that it produces is fair game to be used?

    Great! Lets go! According to the opinion-head-piece, maybe we can call this copyright-washing media. Of course, that'll result in another (worthless) opinion.

  • (Score: 1) by MonkeypoxBugChaser on Thursday December 26, @12:40AM (2 children)

    by MonkeypoxBugChaser (17904) on Thursday December 26, @12:40AM (#1386452) Homepage Journal

    I heard the real Alexander Hanff died in 2019. Who is this loser?

    • (Score: 2) by mcgrew on Thursday December 26, @01:48AM (1 child)

      by mcgrew (701) <publish@mcgrewbooks.com> on Thursday December 26, @01:48AM (#1386457) Homepage Journal

      And was Paul Simon a US Senator or a pop singer?

      --
      It is a disgrace that the richest nation in the world has hunger and homelessness.
      • (Score: 1, Funny) by Anonymous Coward on Thursday December 26, @04:36AM

        by Anonymous Coward on Thursday December 26, @04:36AM (#1386465)

        And was Paul Simon a US Senator or a pop singer?

        It's a floor wax and a dessert topping!

  • (Score: 2) by mcgrew on Thursday December 26, @01:46AM

    by mcgrew (701) <publish@mcgrewbooks.com> on Thursday December 26, @01:46AM (#1386456) Homepage Journal

    Tonight's episode had an excellent piece on AI, with some of the people making AI chatbots saying much like I've said here [soylentnews.org], in different words. 12/25/24 if you have the app.

    --
    It is a disgrace that the richest nation in the world has hunger and homelessness.
  • (Score: 5, Insightful) by ElizabethGreene on Thursday December 26, @03:53PM

    by ElizabethGreene (6748) on Thursday December 26, @03:53PM (#1386485) Journal

    I suspect the authors of the works stolen to train the LLMs would object here. This would permanently incorporate their work into the public domain with no compensation.

    OpenAI should pay for the works they stole to train the models. That means licensing every single book in the illegally acquired Books3 dataset.

    Figuring out who needs to get paid, if anyone, for scraped web content is harder. e.g. Reddit had its content displayed for public view, and I don't oppose an individual or LLM learning from it. There's an argument to be made about that content, but it's much fuzzier than the 100% solid you-broke-the-law case of using illegally downloaded copyrighted published books in the training set.

(1)