Stories
Slash Boxes
Comments

SoylentNews is people

posted by requerdanos on Wednesday July 12 2023, @09:02AM   Printer-friendly
from the regurgitation dept.

https://arstechnica.com/information-technology/2023/07/book-authors-sue-openai-and-meta-over-text-used-to-train-ai/

On Friday, the Joseph Saveri Law Firm filed US federal class-action lawsuits on behalf of Sarah Silverman and other authors against OpenAI and Meta, accusing the companies of illegally using copyrighted material to train AI language models such as ChatGPT and LLaMA.

Other authors represented include Christopher Golden and Richard Kadrey, and an earlier class-action lawsuit filed by the same firm on June 28 included authors Paul Tremblay and Mona Awad. Each lawsuit alleges violations of the Digital Millennium Copyright Act, unfair competition laws, and negligence.

[...] Authors claim that by utilizing "flagrantly illegal" data sets, OpenAI allegedly infringed copyrights of Silverman's book The Bedwetter, Golden's Ararat, and Kadrey's Sandman Slime. And Meta allegedly infringed copyrights of the same three books, as well as "several" other titles from Golden and Kadrey.

[...] Authors are already upset that companies seem to be unfairly profiting off their copyrighted materials, and the Meta lawsuit noted that any unfair profits currently gained could further balloon, as "Meta plans to make the next version of LLaMA commercially available." In addition to other damages, the authors are asking for restitution of alleged profits lost.

"Much of the material in the training datasets used by OpenAI and Meta comes from copyrighted works—including books written by plain­tiffs—that were copied by OpenAI and Meta without consent, without credit, and without compensation," Saveri and Butterick wrote in their press release.


Original Submission

 
This discussion was created by requerdanos (5997) for logged-in users only, but now has been archived. No new comments can be posted.
Display Options Threshold/Breakthrough Mark All as Read Mark All as Unread
The Fine Print: The following comments are owned by whoever posted them. We are not responsible for them in any way.
  • (Score: 3, Interesting) by sigterm on Wednesday July 12 2023, @10:01AM (16 children)

    by sigterm (849) on Wednesday July 12 2023, @10:01AM (#1315672)

    I have some doubts about it being illegal to scrape material that's been published on the open Internet.

    But I say Google and Meta should just remove this from their data sets. Sure, I will no longer be able to ask LLaMA or ChatGPT to do stuff like "rewrite the Declaration of Independence in the style of a painfully unfunny comedian," but I can live with that.

    Starting Score:    1  point
    Moderation   +1  
       Interesting=1, Total=1
    Extra 'Interesting' Modifier   0  
    Karma-Bonus Modifier   +1  

    Total Score:   3  
  • (Score: 4, Insightful) by canopic jug on Wednesday July 12 2023, @12:30PM (9 children)

    by canopic jug (3949) Subscriber Badge on Wednesday July 12 2023, @12:30PM (#1315685) Journal

    It goes far beyond just scraping. The LLMs make a local copy, strip attribution and copyright information, and the regurgitate the result as original work. It's plagiarism as a service, but set up so that one can blame "the algorithm" long enough to confuse technologically inept legislators, lawyers, jurors, and judges.

    --
    Money is not free speech. Elections should not be auctions.
    • (Score: 5, Interesting) by looorg on Wednesday July 12 2023, @02:03PM (2 children)

      by looorg (578) on Wednesday July 12 2023, @02:03PM (#1315692)

      That is in part what is so baffling about it all. Most, if not all, of these people involved in this LLM AI whatever come from the academic world, where referencing is everything. Anyone really working on this that doesn't have at least a Masters degree or a PhD? They might be used to trying to do things on the cheap and free. But you never skimp on referencing. After all being caught out as a plagiarist is basically a career ending offense in some part.

      • (Score: 3, Touché) by Freeman on Wednesday July 12 2023, @04:14PM (1 child)

        by Freeman (732) on Wednesday July 12 2023, @04:14PM (#1315718) Journal

        Nothing to be baffled about $.$ vision.

        --
        Joshua 1:9 "Be strong and of a good courage; be not afraid, neither be thou dismayed: for the Lord thy God is with thee"
        • (Score: 3, Insightful) by DeathMonkey on Wednesday July 12 2023, @07:16PM

          by DeathMonkey (1380) on Wednesday July 12 2023, @07:16PM (#1315746) Journal

          Also, the people building the engine and the people scraping the web into an engine are different people.

    • (Score: 0) by Anonymous Coward on Wednesday July 12 2023, @02:25PM

      by Anonymous Coward on Wednesday July 12 2023, @02:25PM (#1315694)

      Eh, just slap a "for personal use only: not for reselling or relicensing, or other professional use" at the bottom of the ChatGPT output page.

    • (Score: 4, Interesting) by ElizabethGreene on Wednesday July 12 2023, @03:30PM (3 children)

      by ElizabethGreene (6748) Subscriber Badge on Wednesday July 12 2023, @03:30PM (#1315706) Journal

      I don't think this is how it works.

      Could you suggest a prompt that will reproduce a substantial portion of Silverman's work in the answer or in the open LLama model point to where in their work is present without attribution or copyright information? The model is a series of weights in hidden layers on a neural net, not terribly dissimilar to the varied strength of connections between neurons in our own noggins.

      Following the idea of copying biology, a LLM learning a book by reading it, aka training on it, is very similar to when you or I read a book. I don't consider reading a book to be infringement IIF Meta, Google, or whoever trained the model had a legal copy of the source book. If I memorize the large portions of the content word-for-word and then reproduce it that would be an infringement, but the current LLMs struggle to do that.

      A valid counterpoint here would be that the LLM is creating derivative work, but the copyright office says a machine can't create a work. Only the human operating it can.
      Another valid counterpoint would be that the LLM *is* a derivative work, falling in the definition "such as a translation, [...], abridgment, condensation, or any other form in which a work may be recast, transformed, or adapted." That's a reasonable assertion, and I think is the one the court will decide.

      • (Score: 3, Interesting) by Anonymous Coward on Wednesday July 12 2023, @04:40PM

        by Anonymous Coward on Wednesday July 12 2023, @04:40PM (#1315721)

        [...] The model is a series of weights in hidden layers on a neural net, not terribly dissimilar to the varied strength of connections between neurons in our own noggins.

        Following the idea of copying biology, a LLM learning a book by reading it, aka training on it, is very similar to when you or I read a book.

        I know that is how neural nets have been colloquially described since at least the 90s, but I don't think one can make a compelling case for that's how our brains work in general. These LLMs have only shown recent success because of the enormous number of connections and the unimaginable amount of data needed to train them, with perfect recall. Our brains seem to function much better using much less, that there are those who feel these massive weighted "neuron" brute force approaches are not the right path and that if/when AGI is achieved, it will be using much more modest hardware and software.

      • (Score: 3, Interesting) by Mykl on Wednesday July 12 2023, @11:22PM

        by Mykl (1112) on Wednesday July 12 2023, @11:22PM (#1315797)

        IIF Meta, Google, or whoever trained the model had a legal copy of the source book

        I think this is the main thrust of the lawsuit - it's doubtful that the trainers can provide a receipt for Silverman's book, or for the other million plus books that have been fed into the machine. If they could show the receipts then there wouldn't be a lawsuit.

        Assuming that the trainers have legal access to the source material, I would be fine with spitting out small exerpts (fair use), summaries or even getting the AI to write "in the style of".

      • (Score: 1) by shrewdsheep on Thursday July 13 2023, @12:53PM

        by shrewdsheep (5215) on Thursday July 13 2023, @12:53PM (#1315907)

        All the weights represent a transformation of the data. It is just a mapping. All references to biology are but a mere vague analogy at this point (ReLU is not remotely what a neuron does). These models (transformations) have been shown to exhibit nearest-neighbor characteristics (i.e. outputting the nearest neighbor in the training data) in several cases and we discussed the google paper showing that some input images are almost completely stored in the network weights here (sorry couldn't find the story). While I agree that literal reproduction is infrequent, the unpredictable nature of such behavior lays burden of proof at the feet of the model IMO.

    • (Score: 2, Interesting) by Anonymous Coward on Thursday July 13 2023, @02:11AM

      by Anonymous Coward on Thursday July 13 2023, @02:11AM (#1315842)

      Yeah I'd be more convinced stuff like this should be legal if for example Microsoft didn't train Copilot on github stuff but on Microsoft's internal source code e.g. Windows, Office etc.

      Then Microsoft is the one taking the risk of others accessing the Windows source code without Microsoft being able to claim copyright infringement...

      In contrast now the GPL stuff is the copyright that could be infringed.

      See also: https://www.theregister.com/2023/06/09/github_copilot_lawsuit/ [theregister.com]

  • (Score: 2) by Thexalon on Wednesday July 12 2023, @05:05PM (2 children)

    by Thexalon (636) on Wednesday July 12 2023, @05:05PM (#1315726)

    It's not a crime to scrape data. It is potentially a copyright violation and thus a civil tort, depending in part on what you do with it afterwords. Especially since a lot of websites have a copyright notice somewhere on the page, which almost definitely got ignored by the scraping bots.

    I'm no lawyer, but this sure seems like a kind of case that was guaranteed to happen eventually. And I could also imagine such a case being settled if the so-called-AI companies set up some sort of system of giving the creators of their source material a portion of whatever proceeds they're getting from what they're creating based on that source material (which could well be a "derivative work" under copyright law).

    --
    The only thing that stops a bad guy with a compiler is a good guy with a compiler.
    • (Score: 3, Informative) by DeathMonkey on Wednesday July 12 2023, @07:22PM

      by DeathMonkey (1380) on Wednesday July 12 2023, @07:22PM (#1315749) Journal

      In this case they're getting sued so it's civil already.

      However, criminal copyright statutes exist as well so civil lawsuits are definitely not the only remedy. (for better or worse...)

    • (Score: 0) by Anonymous Coward on Thursday July 13 2023, @03:11AM

      by Anonymous Coward on Thursday July 13 2023, @03:11AM (#1315856)

      It is potentially a copyright violation and thus a civil tort, depending in part on what you do with it afterwords.

      Yeah Microsoft seems fine with scraping GPL code and using it for Copilot but are they doing using the Windows, MS Office etc source code for Copilot?

  • (Score: 4, Interesting) by mcgrew on Thursday July 13 2023, @01:04AM (2 children)

    by mcgrew (701) <publish@mcgrewbooks.com> on Thursday July 13 2023, @01:04AM (#1315823) Homepage Journal

    I have some doubts

    Do you? Are you lost, little one? Google is your friend, evil as it is. Publishing on the open internet does NOT invalidate a copyright. Where did you come up with such a ridiculous idea?

    And FYI, the Declaration of Independence or the Constitution are NOT covered under copyright. Or recipes, dance, or clothing patterns. Educate yourself before you attempt to educate others.

    --
    mcgrewbooks.com mcgrew.info nooze.org
    • (Score: 2) by sigterm on Sunday July 16 2023, @06:49PM (1 child)

      by sigterm (849) on Sunday July 16 2023, @06:49PM (#1316364)

      Do you? Are you lost, little one? Google is your friend, evil as it is. Publishing on the open internet does NOT invalidate a copyright. Where did you come up with such a ridiculous idea?

      Where did you get the idea that I was arguing against copyright law? I'm not.

      Leaving aside the obvious copyright infringement that is wholesale reproduction of unaltered content, which is not being argued here: There is such a thing as "fair use." Unless ChatGPT is using the material in a non-novel way, and/or are creating a (derivative) product that takes market share from the original content, they'll have a hard time arguing that their copyright is being violated.

      Plaintiff: "Your Honor, defendant read/say the content we distributed on the Internet, and is now creating derivative works that only vaguely resemble the original!" Defendant: "Yes, we are." Judge: "That's perfectly allowed. Next case!"

      • (Score: 2) by mcgrew on Saturday July 22 2023, @09:04PM

        by mcgrew (701) <publish@mcgrewbooks.com> on Saturday July 22 2023, @09:04PM (#1317280) Homepage Journal

        How much and to what purpose? Even if it's a single sentence, if the author isn't credited, it's plagiarism. Fair use credits the original author. If he copies five paragraphs and puts "a passage from [name of work]:", or indented with a footnote after it is fair use. A sentence without credit is plagiarism, period. If the computer credits all those it copies with what it has copied, it may just be kosher. But I wouldn't bet on it.

        --
        mcgrewbooks.com mcgrew.info nooze.org