Stories
Slash Boxes
Comments

SoylentNews is people

posted by requerdanos on Wednesday July 12 2023, @09:02AM   Printer-friendly
from the regurgitation dept.

https://arstechnica.com/information-technology/2023/07/book-authors-sue-openai-and-meta-over-text-used-to-train-ai/

On Friday, the Joseph Saveri Law Firm filed US federal class-action lawsuits on behalf of Sarah Silverman and other authors against OpenAI and Meta, accusing the companies of illegally using copyrighted material to train AI language models such as ChatGPT and LLaMA.

Other authors represented include Christopher Golden and Richard Kadrey, and an earlier class-action lawsuit filed by the same firm on June 28 included authors Paul Tremblay and Mona Awad. Each lawsuit alleges violations of the Digital Millennium Copyright Act, unfair competition laws, and negligence.

[...] Authors claim that by utilizing "flagrantly illegal" data sets, OpenAI allegedly infringed copyrights of Silverman's book The Bedwetter, Golden's Ararat, and Kadrey's Sandman Slime. And Meta allegedly infringed copyrights of the same three books, as well as "several" other titles from Golden and Kadrey.

[...] Authors are already upset that companies seem to be unfairly profiting off their copyrighted materials, and the Meta lawsuit noted that any unfair profits currently gained could further balloon, as "Meta plans to make the next version of LLaMA commercially available." In addition to other damages, the authors are asking for restitution of alleged profits lost.

"Much of the material in the training datasets used by OpenAI and Meta comes from copyrighted works—including books written by plain­tiffs—that were copied by OpenAI and Meta without consent, without credit, and without compensation," Saveri and Butterick wrote in their press release.


Original Submission

 
This discussion was created by requerdanos (5997) for logged-in users only, but now has been archived. No new comments can be posted.
Display Options Threshold/Breakthrough Mark All as Read Mark All as Unread
The Fine Print: The following comments are owned by whoever posted them. We are not responsible for them in any way.
  • (Score: 4, Interesting) by ElizabethGreene on Wednesday July 12 2023, @03:30PM (3 children)

    by ElizabethGreene (6748) Subscriber Badge on Wednesday July 12 2023, @03:30PM (#1315706) Journal

    I don't think this is how it works.

    Could you suggest a prompt that will reproduce a substantial portion of Silverman's work in the answer or in the open LLama model point to where in their work is present without attribution or copyright information? The model is a series of weights in hidden layers on a neural net, not terribly dissimilar to the varied strength of connections between neurons in our own noggins.

    Following the idea of copying biology, a LLM learning a book by reading it, aka training on it, is very similar to when you or I read a book. I don't consider reading a book to be infringement IIF Meta, Google, or whoever trained the model had a legal copy of the source book. If I memorize the large portions of the content word-for-word and then reproduce it that would be an infringement, but the current LLMs struggle to do that.

    A valid counterpoint here would be that the LLM is creating derivative work, but the copyright office says a machine can't create a work. Only the human operating it can.
    Another valid counterpoint would be that the LLM *is* a derivative work, falling in the definition "such as a translation, [...], abridgment, condensation, or any other form in which a work may be recast, transformed, or adapted." That's a reasonable assertion, and I think is the one the court will decide.

    Starting Score:    1  point
    Moderation   +2  
       Interesting=2, Total=2
    Extra 'Interesting' Modifier   0  
    Karma-Bonus Modifier   +1  

    Total Score:   4  
  • (Score: 3, Interesting) by Anonymous Coward on Wednesday July 12 2023, @04:40PM

    by Anonymous Coward on Wednesday July 12 2023, @04:40PM (#1315721)

    [...] The model is a series of weights in hidden layers on a neural net, not terribly dissimilar to the varied strength of connections between neurons in our own noggins.

    Following the idea of copying biology, a LLM learning a book by reading it, aka training on it, is very similar to when you or I read a book.

    I know that is how neural nets have been colloquially described since at least the 90s, but I don't think one can make a compelling case for that's how our brains work in general. These LLMs have only shown recent success because of the enormous number of connections and the unimaginable amount of data needed to train them, with perfect recall. Our brains seem to function much better using much less, that there are those who feel these massive weighted "neuron" brute force approaches are not the right path and that if/when AGI is achieved, it will be using much more modest hardware and software.

  • (Score: 3, Interesting) by Mykl on Wednesday July 12 2023, @11:22PM

    by Mykl (1112) on Wednesday July 12 2023, @11:22PM (#1315797)

    IIF Meta, Google, or whoever trained the model had a legal copy of the source book

    I think this is the main thrust of the lawsuit - it's doubtful that the trainers can provide a receipt for Silverman's book, or for the other million plus books that have been fed into the machine. If they could show the receipts then there wouldn't be a lawsuit.

    Assuming that the trainers have legal access to the source material, I would be fine with spitting out small exerpts (fair use), summaries or even getting the AI to write "in the style of".

  • (Score: 1) by shrewdsheep on Thursday July 13 2023, @12:53PM

    by shrewdsheep (5215) on Thursday July 13 2023, @12:53PM (#1315907)

    All the weights represent a transformation of the data. It is just a mapping. All references to biology are but a mere vague analogy at this point (ReLU is not remotely what a neuron does). These models (transformations) have been shown to exhibit nearest-neighbor characteristics (i.e. outputting the nearest neighbor in the training data) in several cases and we discussed the google paper showing that some input images are almost completely stored in the network weights here (sorry couldn't find the story). While I agree that literal reproduction is infrequent, the unpredictable nature of such behavior lays burden of proof at the feet of the model IMO.