SoylentNews Comments | Sarah Silverman Sues OpenAI, Meta for Being "Industrial-Strength Plagiarists"

Sarah Silverman Sues OpenAI, Meta for Being "Industrial-Strength Plagiarists"

posted by requerdanos on Wednesday July 12 2023, @09:02AM

from the regurgitation dept.

https://arstechnica.com/information-technology/2023/07/book-authors-sue-openai-and-meta-over-text-used-to-train-ai/

On Friday, the Joseph Saveri Law Firm filed US federal class-action lawsuits on behalf of Sarah Silverman and other authors against OpenAI and Meta, accusing the companies of illegally using copyrighted material to train AI language models such as ChatGPT and LLaMA.
Other authors represented include Christopher Golden and Richard Kadrey, and an earlier class-action lawsuit filed by the same firm on June 28 included authors Paul Tremblay and Mona Awad. Each lawsuit alleges violations of the Digital Millennium Copyright Act, unfair competition laws, and negligence.
[...] Authors claim that by utilizing "flagrantly illegal" data sets, OpenAI allegedly infringed copyrights of Silverman's book The Bedwetter, Golden's Ararat, and Kadrey's Sandman Slime. And Meta allegedly infringed copyrights of the same three books, as well as "several" other titles from Golden and Kadrey.
[...] Authors are already upset that companies seem to be unfairly profiting off their copyrighted materials, and the Meta lawsuit noted that any unfair profits currently gained could further balloon, as "Meta plans to make the next version of LLaMA commercially available." In addition to other damages, the authors are asking for restitution of alleged profits lost.
"Much of the material in the training datasets used by OpenAI and Meta comes from copyrighted works—including books written by plaintiffs—that were copied by OpenAI and Meta without consent, without credit, and without compensation," Saveri and Butterick wrote in their press release.

Original Submission

This discussion was created by requerdanos (5997) for logged-in users only, but now has been archived. No new comments can be posted.

Sarah Silverman Sues OpenAI, Meta for Being "Industrial-Strength Plagiarists" | Log In/Create an Account | Top | 48 comments | Search Discussion

The Fine Print: The following comments are owned by whoever posted them. We are not responsible for them in any way.

"Flagrantly illegal?""Flagrantly illegal?" (Score: 3, Interesting) by sigterm on Wednesday July 12 2023, @10:01AM (16 children)

by sigterm (849) on Wednesday July 12 2023, @10:01AM (#1315672)

I have some doubts about it being illegal to scrape material that's been published on the open Internet.

But I say Google and Meta should just remove this from their data sets. Sure, I will no longer be able to ask LLaMA or ChatGPT to do stuff like "rewrite the Declaration of Independence in the style of a painfully unfunny comedian," but I can live with that.

Starting Score:	1		point
Moderation		+1
Interesting=1, Total=1
Extra 'Interesting' Modifier		0
Karma-Bonus Modifier		+1

Total Score:		3

Beyond scrapingBeyond scraping (Score: 4, Insightful) by canopic jug on Wednesday July 12 2023, @12:30PM (9 children)

by canopic jug (3949) on Wednesday July 12 2023, @12:30PM (#1315685) Journal

It goes far beyond just scraping. The LLMs make a local copy, strip attribution and copyright information, and the regurgitate the result as original work. It's plagiarism as a service, but set up so that one can blame "the algorithm" long enough to confuse technologically inept legislators, lawyers, jurors, and judges.

--
Money is not free speech. Elections should not be auctions.

Parent
- Re:Beyond scrapingRe:Beyond scraping (Score: 5, Interesting) by looorg on Wednesday July 12 2023, @02:03PM (2 children)
  
  by looorg (578) on Wednesday July 12 2023, @02:03PM (#1315692)
  
  That is in part what is so baffling about it all. Most, if not all, of these people involved in this LLM AI whatever come from the academic world, where referencing is everything. Anyone really working on this that doesn't have at least a Masters degree or a PhD? They might be used to trying to do things on the cheap and free. But you never skimp on referencing. After all being caught out as a plagiarist is basically a career ending offense in some part.
  
  Parent
  - Re:Beyond scrapingRe:Beyond scraping (Score: 3, Touché) by Freeman on Wednesday July 12 2023, @04:14PM (1 child)
    
    by Freeman (732) on Wednesday July 12 2023, @04:14PM (#1315718) Journal
    
    Nothing to be baffled about $.$ vision.
    
    --
    Joshua 1:9 "Be strong and of a good courage; be not afraid, neither be thou dismayed: for the Lord thy God is with thee"
    
    Parent
    - Re:Beyond scraping(Score: 3, Insightful) by DeathMonkey on Wednesday July 12 2023, @07:16PM
      
      by DeathMonkey (1380) on Wednesday July 12 2023, @07:16PM (#1315746) Journal
      
      Also, the people building the engine and the people scraping the web into an engine are different people.
      
      Parent
- Re:Beyond scraping(Score: 0) by Anonymous Coward on Wednesday July 12 2023, @02:25PM
  
  by Anonymous Coward on Wednesday July 12 2023, @02:25PM (#1315694)
  
  Eh, just slap a "for personal use only: not for reselling or relicensing, or other professional use" at the bottom of the ChatGPT output page.
  
  Parent
- Re:Beyond scrapingRe:Beyond scraping (Score: 4, Interesting) by ElizabethGreene on Wednesday July 12 2023, @03:30PM (3 children)
  
  by ElizabethGreene (6748) on Wednesday July 12 2023, @03:30PM (#1315706) Journal
  
  I don't think this is how it works.
  Could you suggest a prompt that will reproduce a substantial portion of Silverman's work in the answer or in the open LLama model point to where in their work is present without attribution or copyright information? The model is a series of weights in hidden layers on a neural net, not terribly dissimilar to the varied strength of connections between neurons in our own noggins.
  Following the idea of copying biology, a LLM learning a book by reading it, aka training on it, is very similar to when you or I read a book. I don't consider reading a book to be infringement IIF Meta, Google, or whoever trained the model had a legal copy of the source book. If I memorize the large portions of the content word-for-word and then reproduce it that would be an infringement, but the current LLMs struggle to do that.
  A valid counterpoint here would be that the LLM is creating derivative work, but the copyright office says a machine can't create a work. Only the human operating it can.
  Another valid counterpoint would be that the LLM *is* a derivative work, falling in the definition "such as a translation, [...], abridgment, condensation, or any other form in which a work may be recast, transformed, or adapted." That's a reasonable assertion, and I think is the one the court will decide.
  
  Parent
  - Re:Beyond scraping(Score: 3, Interesting) by Anonymous Coward on Wednesday July 12 2023, @04:40PM
    
    by Anonymous Coward on Wednesday July 12 2023, @04:40PM (#1315721)
    
    [...] The model is a series of weights in hidden layers on a neural net, not terribly dissimilar to the varied strength of connections between neurons in our own noggins.
    Following the idea of copying biology, a LLM learning a book by reading it, aka training on it, is very similar to when you or I read a book.
    
    I know that is how neural nets have been colloquially described since at least the 90s, but I don't think one can make a compelling case for that's how our brains work in general. These LLMs have only shown recent success because of the enormous number of connections and the unimaginable amount of data needed to train them, with perfect recall. Our brains seem to function much better using much less, that there are those who feel these massive weighted "neuron" brute force approaches are not the right path and that if/when AGI is achieved, it will be using much more modest hardware and software.
    
    Parent
  - Re:Beyond scraping(Score: 3, Interesting) by Mykl on Wednesday July 12 2023, @11:22PM
    
    by Mykl (1112) on Wednesday July 12 2023, @11:22PM (#1315797)
    
    IIF Meta, Google, or whoever trained the model had a legal copy of the source book
    I think this is the main thrust of the lawsuit - it's doubtful that the trainers can provide a receipt for Silverman's book, or for the other million plus books that have been fed into the machine. If they could show the receipts then there wouldn't be a lawsuit.
    Assuming that the trainers have legal access to the source material, I would be fine with spitting out small exerpts (fair use), summaries or even getting the AI to write "in the style of".
    
    Parent
  - Re:Beyond scraping(Score: 1) by shrewdsheep on Thursday July 13 2023, @12:53PM
    
    by shrewdsheep (5215) on Thursday July 13 2023, @12:53PM (#1315907)
    
    All the weights represent a transformation of the data. It is just a mapping. All references to biology are but a mere vague analogy at this point (ReLU is not remotely what a neuron does). These models (transformations) have been shown to exhibit nearest-neighbor characteristics (i.e. outputting the nearest neighbor in the training data) in several cases and we discussed the google paper showing that some input images are almost completely stored in the network weights here (sorry couldn't find the story). While I agree that literal reproduction is infrequent, the unpredictable nature of such behavior lays burden of proof at the feet of the model IMO.
    
    Parent
- Re:Beyond scraping(Score: 2, Interesting) by Anonymous Coward on Thursday July 13 2023, @02:11AM
  
  by Anonymous Coward on Thursday July 13 2023, @02:11AM (#1315842)
  
  Yeah I'd be more convinced stuff like this should be legal if for example Microsoft didn't train Copilot on github stuff but on Microsoft's internal source code e.g. Windows, Office etc.
  Then Microsoft is the one taking the risk of others accessing the Windows source code without Microsoft being able to claim copyright infringement...
  In contrast now the GPL stuff is the copyright that could be infringed.
  See also: https://www.theregister.com/2023/06/09/github_copilot_lawsuit/ [theregister.com]
  
  Parent
Re:"Flagrantly illegal?"Re:"Flagrantly illegal?" (Score: 2) by Thexalon on Wednesday July 12 2023, @05:05PM (2 children)

by Thexalon (636) on Wednesday July 12 2023, @05:05PM (#1315726)

It's not a crime to scrape data. It is potentially a copyright violation and thus a civil tort, depending in part on what you do with it afterwords. Especially since a lot of websites have a copyright notice somewhere on the page, which almost definitely got ignored by the scraping bots.
I'm no lawyer, but this sure seems like a kind of case that was guaranteed to happen eventually. And I could also imagine such a case being settled if the so-called-AI companies set up some sort of system of giving the creators of their source material a portion of whatever proceeds they're getting from what they're creating based on that source material (which could well be a "derivative work" under copyright law).

--
"Think of how stupid the average person is. Then realize half of 'em are stupider than that." - George Carlin

Parent
- Re:"Flagrantly illegal?"(Score: 3, Informative) by DeathMonkey on Wednesday July 12 2023, @07:22PM
  
  by DeathMonkey (1380) on Wednesday July 12 2023, @07:22PM (#1315749) Journal
  
  In this case they're getting sued so it's civil already.
  However, criminal copyright statutes exist as well so civil lawsuits are definitely not the only remedy. (for better or worse...)
  
  Parent
- Re:"Flagrantly illegal?"(Score: 0) by Anonymous Coward on Thursday July 13 2023, @03:11AM
  
  by Anonymous Coward on Thursday July 13 2023, @03:11AM (#1315856)
  
  It is potentially a copyright violation and thus a civil tort, depending in part on what you do with it afterwords.
  Yeah Microsoft seems fine with scraping GPL code and using it for Copilot but are they doing using the Windows, MS Office etc source code for Copilot?
  
  Parent
Re:"Flagrantly illegal?"Re:"Flagrantly illegal?" (Score: 4, Interesting) by mcgrew on Thursday July 13 2023, @01:04AM (2 children)

by mcgrew (701) <publish@mcgrewbooks.com> on Thursday July 13 2023, @01:04AM (#1315823) Homepage Journal

I have some doubts
Do you? Are you lost, little one? Google is your friend, evil as it is. Publishing on the open internet does NOT invalidate a copyright. Where did you come up with such a ridiculous idea?
And FYI, the Declaration of Independence or the Constitution are NOT covered under copyright. Or recipes, dance, or clothing patterns. Educate yourself before you attempt to educate others.

--
Impeach Donald Saruman and his sidekick Elon Sauron

Parent
- Re:"Flagrantly illegal?"Re:"Flagrantly illegal?" (Score: 2) by sigterm on Sunday July 16 2023, @06:49PM (1 child)
  
  by sigterm (849) on Sunday July 16 2023, @06:49PM (#1316364)
  
  Do you? Are you lost, little one? Google is your friend, evil as it is. Publishing on the open internet does NOT invalidate a copyright. Where did you come up with such a ridiculous idea?
  Where did you get the idea that I was arguing against copyright law? I'm not.
  Leaving aside the obvious copyright infringement that is wholesale reproduction of unaltered content, which is not being argued here: There is such a thing as "fair use." Unless ChatGPT is using the material in a non-novel way, and/or are creating a (derivative) product that takes market share from the original content, they'll have a hard time arguing that their copyright is being violated.
  Plaintiff: "Your Honor, defendant read/say the content we distributed on the Internet, and is now creating derivative works that only vaguely resemble the original!" Defendant: "Yes, we are." Judge: "That's perfectly allowed. Next case!"
  
  Parent
  - Re:"Flagrantly illegal?"(Score: 2) by mcgrew on Saturday July 22 2023, @09:04PM
    
    by mcgrew (701) <publish@mcgrewbooks.com> on Saturday July 22 2023, @09:04PM (#1317280) Homepage Journal
    
    How much and to what purpose? Even if it's a single sentence, if the author isn't credited, it's plagiarism. Fair use credits the original author. If he copies five paragraphs and puts "a passage from [name of work]:", or indented with a footnote after it is fair use. A sentence without credit is plagiarism, period. If the computer credits all those it copies with what it has copied, it may just be kosher. But I wouldn't bet on it.
    
    --
    Impeach Donald Saruman and his sidekick Elon Sauron
    
    Parent

Moderator Help

SoylentNews

SoylentNews is people

Navigation

Sections

SoylentNews

Sarah Silverman Sues OpenAI, Meta for Being "Industrial-Strength Plagiarists"

"Flagrantly illegal?""Flagrantly illegal?" (Score: 3, Interesting) by sigterm on Wednesday July 12 2023, @10:01AM (16 children)

Beyond scrapingBeyond scraping (Score: 4, Insightful) by canopic jug on Wednesday July 12 2023, @12:30PM (9 children)

Re:Beyond scrapingRe:Beyond scraping (Score: 5, Interesting) by looorg on Wednesday July 12 2023, @02:03PM (2 children)

Re:Beyond scrapingRe:Beyond scraping (Score: 3, Touché) by Freeman on Wednesday July 12 2023, @04:14PM (1 child)

Re:Beyond scraping(Score: 3, Insightful) by DeathMonkey on Wednesday July 12 2023, @07:16PM

Re:Beyond scraping(Score: 0) by Anonymous Coward on Wednesday July 12 2023, @02:25PM

Re:Beyond scrapingRe:Beyond scraping (Score: 4, Interesting) by ElizabethGreene on Wednesday July 12 2023, @03:30PM (3 children)

Re:Beyond scraping(Score: 3, Interesting) by Anonymous Coward on Wednesday July 12 2023, @04:40PM

Re:Beyond scraping(Score: 3, Interesting) by Mykl on Wednesday July 12 2023, @11:22PM

Re:Beyond scraping(Score: 1) by shrewdsheep on Thursday July 13 2023, @12:53PM

Re:Beyond scraping(Score: 2, Interesting) by Anonymous Coward on Thursday July 13 2023, @02:11AM

Re:"Flagrantly illegal?"Re:"Flagrantly illegal?" (Score: 2) by Thexalon on Wednesday July 12 2023, @05:05PM (2 children)

Re:"Flagrantly illegal?"(Score: 3, Informative) by DeathMonkey on Wednesday July 12 2023, @07:22PM

Re:"Flagrantly illegal?"(Score: 0) by Anonymous Coward on Thursday July 13 2023, @03:11AM

Re:"Flagrantly illegal?"Re:"Flagrantly illegal?" (Score: 4, Interesting) by mcgrew on Thursday July 13 2023, @01:04AM (2 children)

Re:"Flagrantly illegal?"Re:"Flagrantly illegal?" (Score: 2) by sigterm on Sunday July 16 2023, @06:49PM (1 child)

Re:"Flagrantly illegal?"(Score: 2) by mcgrew on Saturday July 22 2023, @09:04PM