Stories
Slash Boxes
Comments

SoylentNews is people

SoylentNews is powered by your submissions, so send in your scoop. Only 17 submissions in the queue.
posted by requerdanos on Wednesday July 12 2023, @09:02AM   Printer-friendly
from the regurgitation dept.

https://arstechnica.com/information-technology/2023/07/book-authors-sue-openai-and-meta-over-text-used-to-train-ai/

On Friday, the Joseph Saveri Law Firm filed US federal class-action lawsuits on behalf of Sarah Silverman and other authors against OpenAI and Meta, accusing the companies of illegally using copyrighted material to train AI language models such as ChatGPT and LLaMA.

Other authors represented include Christopher Golden and Richard Kadrey, and an earlier class-action lawsuit filed by the same firm on June 28 included authors Paul Tremblay and Mona Awad. Each lawsuit alleges violations of the Digital Millennium Copyright Act, unfair competition laws, and negligence.

[...] Authors claim that by utilizing "flagrantly illegal" data sets, OpenAI allegedly infringed copyrights of Silverman's book The Bedwetter, Golden's Ararat, and Kadrey's Sandman Slime. And Meta allegedly infringed copyrights of the same three books, as well as "several" other titles from Golden and Kadrey.

[...] Authors are already upset that companies seem to be unfairly profiting off their copyrighted materials, and the Meta lawsuit noted that any unfair profits currently gained could further balloon, as "Meta plans to make the next version of LLaMA commercially available." In addition to other damages, the authors are asking for restitution of alleged profits lost.

"Much of the material in the training datasets used by OpenAI and Meta comes from copyrighted works—including books written by plain­tiffs—that were copied by OpenAI and Meta without consent, without credit, and without compensation," Saveri and Butterick wrote in their press release.


Original Submission

 
This discussion was created by requerdanos (5997) for logged-in users only, but now has been archived. No new comments can be posted.
Display Options Threshold/Breakthrough Mark All as Read Mark All as Unread
The Fine Print: The following comments are owned by whoever posted them. We are not responsible for them in any way.
  • (Score: 4, Insightful) by canopic jug on Wednesday July 12 2023, @12:30PM (9 children)

    by canopic jug (3949) Subscriber Badge on Wednesday July 12 2023, @12:30PM (#1315685) Journal

    It goes far beyond just scraping. The LLMs make a local copy, strip attribution and copyright information, and the regurgitate the result as original work. It's plagiarism as a service, but set up so that one can blame "the algorithm" long enough to confuse technologically inept legislators, lawyers, jurors, and judges.

    --
    Money is not free speech. Elections should not be auctions.
    Starting Score:    1  point
    Moderation   +2  
       Insightful=2, Informative=1, Overrated=1, Total=4
    Extra 'Insightful' Modifier   0  
    Karma-Bonus Modifier   +1  

    Total Score:   4  
  • (Score: 5, Interesting) by looorg on Wednesday July 12 2023, @02:03PM (2 children)

    by looorg (578) on Wednesday July 12 2023, @02:03PM (#1315692)

    That is in part what is so baffling about it all. Most, if not all, of these people involved in this LLM AI whatever come from the academic world, where referencing is everything. Anyone really working on this that doesn't have at least a Masters degree or a PhD? They might be used to trying to do things on the cheap and free. But you never skimp on referencing. After all being caught out as a plagiarist is basically a career ending offense in some part.

    • (Score: 3, Touché) by Freeman on Wednesday July 12 2023, @04:14PM (1 child)

      by Freeman (732) on Wednesday July 12 2023, @04:14PM (#1315718) Journal

      Nothing to be baffled about $.$ vision.

      --
      Joshua 1:9 "Be strong and of a good courage; be not afraid, neither be thou dismayed: for the Lord thy God is with thee"
      • (Score: 3, Insightful) by DeathMonkey on Wednesday July 12 2023, @07:16PM

        by DeathMonkey (1380) on Wednesday July 12 2023, @07:16PM (#1315746) Journal

        Also, the people building the engine and the people scraping the web into an engine are different people.

  • (Score: 0) by Anonymous Coward on Wednesday July 12 2023, @02:25PM

    by Anonymous Coward on Wednesday July 12 2023, @02:25PM (#1315694)

    Eh, just slap a "for personal use only: not for reselling or relicensing, or other professional use" at the bottom of the ChatGPT output page.

  • (Score: 4, Interesting) by ElizabethGreene on Wednesday July 12 2023, @03:30PM (3 children)

    by ElizabethGreene (6748) Subscriber Badge on Wednesday July 12 2023, @03:30PM (#1315706) Journal

    I don't think this is how it works.

    Could you suggest a prompt that will reproduce a substantial portion of Silverman's work in the answer or in the open LLama model point to where in their work is present without attribution or copyright information? The model is a series of weights in hidden layers on a neural net, not terribly dissimilar to the varied strength of connections between neurons in our own noggins.

    Following the idea of copying biology, a LLM learning a book by reading it, aka training on it, is very similar to when you or I read a book. I don't consider reading a book to be infringement IIF Meta, Google, or whoever trained the model had a legal copy of the source book. If I memorize the large portions of the content word-for-word and then reproduce it that would be an infringement, but the current LLMs struggle to do that.

    A valid counterpoint here would be that the LLM is creating derivative work, but the copyright office says a machine can't create a work. Only the human operating it can.
    Another valid counterpoint would be that the LLM *is* a derivative work, falling in the definition "such as a translation, [...], abridgment, condensation, or any other form in which a work may be recast, transformed, or adapted." That's a reasonable assertion, and I think is the one the court will decide.

    • (Score: 3, Interesting) by Anonymous Coward on Wednesday July 12 2023, @04:40PM

      by Anonymous Coward on Wednesday July 12 2023, @04:40PM (#1315721)

      [...] The model is a series of weights in hidden layers on a neural net, not terribly dissimilar to the varied strength of connections between neurons in our own noggins.

      Following the idea of copying biology, a LLM learning a book by reading it, aka training on it, is very similar to when you or I read a book.

      I know that is how neural nets have been colloquially described since at least the 90s, but I don't think one can make a compelling case for that's how our brains work in general. These LLMs have only shown recent success because of the enormous number of connections and the unimaginable amount of data needed to train them, with perfect recall. Our brains seem to function much better using much less, that there are those who feel these massive weighted "neuron" brute force approaches are not the right path and that if/when AGI is achieved, it will be using much more modest hardware and software.

    • (Score: 3, Interesting) by Mykl on Wednesday July 12 2023, @11:22PM

      by Mykl (1112) on Wednesday July 12 2023, @11:22PM (#1315797)

      IIF Meta, Google, or whoever trained the model had a legal copy of the source book

      I think this is the main thrust of the lawsuit - it's doubtful that the trainers can provide a receipt for Silverman's book, or for the other million plus books that have been fed into the machine. If they could show the receipts then there wouldn't be a lawsuit.

      Assuming that the trainers have legal access to the source material, I would be fine with spitting out small exerpts (fair use), summaries or even getting the AI to write "in the style of".

    • (Score: 1) by shrewdsheep on Thursday July 13 2023, @12:53PM

      by shrewdsheep (5215) on Thursday July 13 2023, @12:53PM (#1315907)

      All the weights represent a transformation of the data. It is just a mapping. All references to biology are but a mere vague analogy at this point (ReLU is not remotely what a neuron does). These models (transformations) have been shown to exhibit nearest-neighbor characteristics (i.e. outputting the nearest neighbor in the training data) in several cases and we discussed the google paper showing that some input images are almost completely stored in the network weights here (sorry couldn't find the story). While I agree that literal reproduction is infrequent, the unpredictable nature of such behavior lays burden of proof at the feet of the model IMO.

  • (Score: 2, Interesting) by Anonymous Coward on Thursday July 13 2023, @02:11AM

    by Anonymous Coward on Thursday July 13 2023, @02:11AM (#1315842)

    Yeah I'd be more convinced stuff like this should be legal if for example Microsoft didn't train Copilot on github stuff but on Microsoft's internal source code e.g. Windows, Office etc.

    Then Microsoft is the one taking the risk of others accessing the Windows source code without Microsoft being able to claim copyright infringement...

    In contrast now the GPL stuff is the copyright that could be infringed.

    See also: https://www.theregister.com/2023/06/09/github_copilot_lawsuit/ [theregister.com]