Stories
Slash Boxes
Comments

SoylentNews is people

posted by martyb on Friday December 29 2023, @03:13PM   Printer-friendly

New York Times Sues Microsoft, ChatGPT Maker OpenAI Over Copyright Infringement

The New York Times on Wednesday filed a lawsuit against Microsoft and OpenAI, the company behind popular AI chatbot ChatGPT, accusing the companies of creating a business model based on "mass copyright infringement," stating their AI systems "exploit and, in many cases, retain large portions of the copyrightable expression contained in those works:"

Microsoft both invests in and supplies OpenAI, providing it with access to the Redmond, Washington, giant's Azure cloud computing technology.

The publisher said in a filing in the U.S. District Court for the Southern District of New York that it seeks to hold Microsoft and OpenAI to account for the "billions of dollars in statutory and actual damages" it believes it is owed for the "unlawful copying and use of The Times's uniquely valuable works."

[...] The Times said in an emailed statement that it "recognizes the power and potential of GenAI for the public and for journalism," but added that journalistic material should be used for commercial gain with permission from the original source.

"These tools were built with and continue to use independent journalism and content that is only available because we and our peers reported, edited, and fact-checked it at high cost and with considerable expertise," the Times said.

"Settled copyright law protects our journalism and content. If Microsoft and OpenAI want to use our work for commercial purposes, the law requires that they first obtain our permission. They have not done so."

[...] OpenAI has tried to allay news publishers concerns. In December, the company announced a partnership with Axel Springer — the parent company of Business Insider, Politico, and European outlets Bild and Welt — which would license its content to OpenAI in return for a fee.

Also at CNBC and The Guardian.

Previously:

NY Times Sues Open AI, Microsoft Over Copyright Infringement

NY Times sues Open AI, Microsoft over copyright infringement:

In August, word leaked out that The New York Times was considering joining the growing legion of creators that are suing AI companies for misappropriating their content. The Times had reportedly been negotiating with OpenAI regarding the potential to license its material, but those talks had not gone smoothly. So, eight months after the company was reportedly considering suing, the suit has now been filed.

The Times is targeting various companies under the OpenAI umbrella, as well as Microsoft, an OpenAI partner that both uses it to power its Copilot service and helped provide the infrastructure for training the GPT Large Language Model. But the suit goes well beyond the use of copyrighted material in training, alleging that OpenAI-powered software will happily circumvent the Times' paywall and ascribe hallucinated misinformation to the Times.

Journalism is expensive

The suit notes that The Times maintains a large staff that allows it to do things like dedicate reporters to a huge range of beats and engage in important investigative journalism, among other things. Because of those investments, the newspaper is often considered an authoritative source on many matters.

All of that costs money, and The Times earns that by limiting access to its reporting through a robust paywall. In addition, each print edition has a copyright notification, the Times' terms of service limit the copying and use of any published material, and it can be selective about how it licenses its stories. In addition to driving revenue, these restrictions also help it to maintain its reputation as an authoritative voice by controlling how its works appear.

The suit alleges that OpenAI-developed tools undermine all of that. "By providing Times content without The Times's permission or authorization, Defendants' tools undermine and damage The Times's relationship with its readers and deprive The Times of subscription, licensing, advertising, and affiliate revenue," the suit alleges.

Part of the unauthorized use The Times alleges came during the training of various versions of GPT. Prior to GPT-3.5, information about the training dataset was made public. One of the sources used is a large collection of online material called "Common Crawl," which the suit alleges contains information from 16 million unique records from sites published by The Times. That places the Times as the third most references source, behind Wikipedia and a database of US patents.

OpenAI no longer discloses as many details of the data used for training of recent GPT versions, but all indications are that full-text NY Times articles are still part of that process. [...] Expect access to training information to be a major issue during discovery if this case moves forward.

Not just training

A number of suits have been filed regarding the use of copyrighted material during training of AI systems. But the Times' suite goes well beyond that to show how the material ingested during training can come back out during use. "Defendants' GenAI tools can generate output that recites Times content verbatim, closely summarizes it, and mimics its expressive style, as demonstrated by scores of examples," the suit alleges.


Original Submission #1Original Submission #2Original Submission #3

 
This discussion was created by martyb (76) for logged-in users only, but now has been archived. No new comments can be posted.
Display Options Threshold/Breakthrough Mark All as Read Mark All as Unread
The Fine Print: The following comments are owned by whoever posted them. We are not responsible for them in any way.
  • (Score: 2) by HiThere on Friday December 29 2023, @07:21PM (10 children)

    by HiThere (866) on Friday December 29 2023, @07:21PM (#1338272) Journal

    OK, but it still implies that ChatGPT memorized the articles in some sense. That's definitely making a copy, just like you do when you reread a poem several times, or a favorite author. (A friend knew someone who could essentially recite the Lord of the Rings. I can do pieces of it, largely poems.)

    So the problem is that if NYT wins the case, the next target may be remembering stuff. I don't really see a clear demarcation.

    --
    Javascript is what you use to allow unknown third parties to run software you have no idea about on your computer.
    Starting Score:    1  point
    Karma-Bonus Modifier   +1  

    Total Score:   2  
  • (Score: 2) by mcgrew on Saturday December 30 2023, @12:27AM (5 children)

    by mcgrew (701) <publish@mcgrewbooks.com> on Saturday December 30 2023, @12:27AM (#1338298) Homepage Journal

    Computers don't "think" any more than a printed book thinks. Stored information is not thought, except that it's the thought of the original thinker who write it down or typed it into a computer.

    I should write an article about copyright, since so few seem to have any idea about it, thanks to the MAFIAA. Copyright is NOT about copying or storing data, it's about publishing it. And it's not automatic in the US thanks to (I think, information is hard for me to find) a lawsuit, as Bowker (the ISBN people) informed me.

    If I put my book on the internet, I have published it. Copyright gives me a "limited time*" monopoly on publication. It has to be registered and costs sixty bucks to register in the US. Recording that Metallica album and giving a copy to your friend is perfectly legal, no matter what that greedy asshole Lars Ulrich thinks.

    * The Bono Act gives me a "limited time" monopoly of ninety five tears longer than my life. I don't see how I'm going to be enticed to write any more books after I'm dead. SCOTUS ruled against common sense and logic, ruling that "limited" means whatever congress says it means.

    --
    It is a disgrace that the richest nation in the world has hunger and homelessness.
    • (Score: 2) by HiThere on Saturday December 30 2023, @01:40AM (4 children)

      by HiThere (866) on Saturday December 30 2023, @01:40AM (#1338305) Journal

      That's an assertion I've heard before, but I've never seen any good proof of it.
      Actually, proof is slightly the wrong word. What's missing is a definition of "thought" that includes what people do, doesn't include what computers to, and doesn't depend on handling them as a special case. The first version of that assertion that I heard was that computers will never play good chess because that can't think. The assertion that they couldn't play good chess was already false at the time, though they weren't up to expert level.

      So give me your explicit definition and perhaps I'll accept that, by your usage, computer can't think. Otherwise I'll just remember the old saying in AI that "intelligence is whatever we haven't managed to do yet".

      --
      Javascript is what you use to allow unknown third parties to run software you have no idea about on your computer.
      • (Score: 3, Informative) by mcgrew on Saturday December 30 2023, @02:27AM (3 children)

        by mcgrew (701) <publish@mcgrewbooks.com> on Saturday December 30 2023, @02:27AM (#1338310) Homepage Journal

        I put it succinctly in the story Sentience [mcgrewbooks.com]. It's written in the first person, the narrator is a sentient computer.

        My views that there will never be a Turing architecture sentient computer comes mostly from the fact that I've studied the schematics of computer components like the ALU (Arithemetic Logic Unit) and written a two player battle tanks game in Z-80 machine code. A computer is no smarter than a printed book.

        Now, replicants, like in RUR [mcgrewbooks.com], with history's first use of the word "robot", or Do Androids Dream of Electric Sheep? ("Blade Runner") may and probably will be sentient.

        --
        It is a disgrace that the richest nation in the world has hunger and homelessness.
        • (Score: 2) by deimtee on Sunday December 31 2023, @02:48AM (2 children)

          by deimtee (3272) on Sunday December 31 2023, @02:48AM (#1338422) Journal

          there will never be a Turing architecture sentient computer

          You are showing an organic bias. There is nothing the cells in the brain do that can't be done on a computer. We just haven't written a program that complicated yet. (Well, publicly at least, I don't know what the TLA's have.)

          Reductio ad absurdum:
          We can write a program to simulate a neuron. We can write a program to simulate an axon. We can design a message passing algorithm that simulates the interconnections. We can design self-modifying programs that mimic the changes in neurons and axons as they are used.
          We can freeze a brain and examine it neuron by neuron and reproduce the neurons and interconnections in it in silicon and programming. It would take a massive effort and a huge amount of computer power but when you turned that program/machine on it would produce the same output as the brain that was scanned.

          A computer is no smarter than a printed book.

          The main difference is that a book is static. A computer can execute code and change the stored information. The glider gun in the Life program demonstrates that even a very simple system can have unlimited growth. It's not intelligent, but neither is a bacterium. You have to build up to intelligence. As far as I know, the simulationists have got as far as a small worm with a few neurons. I think there is a group working on simulating a fly's brain.

          --
          One job constant is that good employers have low turnover, so opportunities to join good employers are relatively rare.
          • (Score: 1, Insightful) by Anonymous Coward on Monday January 01 2024, @02:09AM

            by Anonymous Coward on Monday January 01 2024, @02:09AM (#1338536)

            late to the party, but in case anyone is still reading..

            > but when you turned that program/machine on it would produce the same output as the brain that was scanned.

            Um, yes. But don't forget gigo. The inputs to the human brain are also somehow encoded, and not much of this is understood yet either. How many processing layers are in the eye, before any signals are sent down the optic nerve? Same applies to I/O with all the other organs both internal and near the skin. Without all the I/O a synthetic brain isn't going to be useful.

            Back to the drawing board.

          • (Score: 2) by mcgrew on Sunday January 07 2024, @06:51PM

            by mcgrew (701) <publish@mcgrewbooks.com> on Sunday January 07 2024, @06:51PM (#1339489) Homepage Journal

            There is nothing the cells in the brain do that can't be done on a computer.

            Fractions. Divide one by three on a computer. Making anything actually original.

            --
            It is a disgrace that the richest nation in the world has hunger and homelessness.
  • (Score: 0) by Anonymous Coward on Saturday December 30 2023, @04:13PM

    by Anonymous Coward on Saturday December 30 2023, @04:13PM (#1338353)
    There's no problem with humans remembering stuff verbatim. It's when they produce copies of that stuff that they may infringe on copyright (depending on the law).

    And humans are responsible for any copyright infringement they make. So some people won't be producing infringing copies of that stuff even if they have good enough memory to do so.

    If Microsoft provided proof that they trained their AI on their own source code ( Windows, Microsoft Office etc) AND then publicly guaranteed that the output of their AI can be used without any copyright issues, especially guaranteeing that any output won't be infringing on Microsoft's copyright. Then sure I might start having a bit more confidence that they're not infringing. And if it happens to output useful Win32 stuff that WINE and ReactOS can now use legally well too bad for Microsoft.

    But instead they train their AI on OTHER people's copyrighted stuff and say that they are not infringing "because AI". To me that's laundering copyright infringement (e.g. GPLed stuff): https://www.theverge.com/2022/11/8/23446821/microsoft-openai-github-copilot-class-action-lawsuit-ai-copyright-violation-training-data

    As the "poem" exploit confirms, these type of AIs have/produce infringing copies of stuff.

    Some idiots argue it's not infringement because the actual stored data doesn't look like the copyrighted stuff and is a lot smaller. If that's a good enough excuse then if I convert a copyrighted Blu-ray to HEVC, I won't be infringing since the data stored and distributed is now very different and a lot smaller. And yes it's provably lossy too - in many cases the output is not 100% the same. But nope, it's still considered infringement.
  • (Score: 2) by maxwell demon on Sunday December 31 2023, @11:47AM (2 children)

    by maxwell demon (1608) on Sunday December 31 2023, @11:47AM (#1338460) Journal

    If I reproduce large chunks of an article from memory and give them to whoever wants them, I'm already violating copyright. It doesn't matter that I first memorized the text and then wrote it down on request instead of writing it down as I read it.

    --
    The Tao of math: The numbers you can count are not the real numbers.
    • (Score: 2) by HiThere on Sunday December 31 2023, @05:31PM (1 child)

      by HiThere (866) on Sunday December 31 2023, @05:31PM (#1338496) Journal

      So singing a song is violation of copyright. Somehow I didn't think copyright law was quite that stupid.

      --
      Javascript is what you use to allow unknown third parties to run software you have no idea about on your computer.
      • (Score: 0) by Anonymous Coward on Monday January 01 2024, @02:11AM

        by Anonymous Coward on Monday January 01 2024, @02:11AM (#1338539)

        old news, see https://support.easysong.com/hc/en-us/articles/360047682433-What-is-a-Public-Performance-License- [easysong.com]

        A public performance license is an agreement between a music user and the owner of a copyrighted composition (song), that grants permission to play the song in public, online, or on radio. This permission is also called public performance rights, performance rights, and performing rights.

        How Do I Get a Public Performance License?

        In most cases, public performance rights should be handled by the institutions, businesses, venues, and radio stations that present the music. Small indie artists, educators, and DJs often don't need to secure public performance rights for private events and most rights organizations do not license to individuals. Also, most web and terrestrial radio stations handle their own public performance licensing, so playing live public radio at your venue is usually fine. If you are unsure about your specific scenario, you should ask the venue or contact a performing rights organization such as ASCAP, BMI, or SESAC for details.