Stories
Slash Boxes
Comments

SoylentNews is people

posted by martyb on Friday December 29 2023, @03:13PM   Printer-friendly

New York Times Sues Microsoft, ChatGPT Maker OpenAI Over Copyright Infringement

The New York Times on Wednesday filed a lawsuit against Microsoft and OpenAI, the company behind popular AI chatbot ChatGPT, accusing the companies of creating a business model based on "mass copyright infringement," stating their AI systems "exploit and, in many cases, retain large portions of the copyrightable expression contained in those works:"

Microsoft both invests in and supplies OpenAI, providing it with access to the Redmond, Washington, giant's Azure cloud computing technology.

The publisher said in a filing in the U.S. District Court for the Southern District of New York that it seeks to hold Microsoft and OpenAI to account for the "billions of dollars in statutory and actual damages" it believes it is owed for the "unlawful copying and use of The Times's uniquely valuable works."

[...] The Times said in an emailed statement that it "recognizes the power and potential of GenAI for the public and for journalism," but added that journalistic material should be used for commercial gain with permission from the original source.

"These tools were built with and continue to use independent journalism and content that is only available because we and our peers reported, edited, and fact-checked it at high cost and with considerable expertise," the Times said.

"Settled copyright law protects our journalism and content. If Microsoft and OpenAI want to use our work for commercial purposes, the law requires that they first obtain our permission. They have not done so."

[...] OpenAI has tried to allay news publishers concerns. In December, the company announced a partnership with Axel Springer — the parent company of Business Insider, Politico, and European outlets Bild and Welt — which would license its content to OpenAI in return for a fee.

Also at CNBC and The Guardian.

Previously:

NY Times Sues Open AI, Microsoft Over Copyright Infringement

NY Times sues Open AI, Microsoft over copyright infringement:

In August, word leaked out that The New York Times was considering joining the growing legion of creators that are suing AI companies for misappropriating their content. The Times had reportedly been negotiating with OpenAI regarding the potential to license its material, but those talks had not gone smoothly. So, eight months after the company was reportedly considering suing, the suit has now been filed.

The Times is targeting various companies under the OpenAI umbrella, as well as Microsoft, an OpenAI partner that both uses it to power its Copilot service and helped provide the infrastructure for training the GPT Large Language Model. But the suit goes well beyond the use of copyrighted material in training, alleging that OpenAI-powered software will happily circumvent the Times' paywall and ascribe hallucinated misinformation to the Times.

Journalism is expensive

The suit notes that The Times maintains a large staff that allows it to do things like dedicate reporters to a huge range of beats and engage in important investigative journalism, among other things. Because of those investments, the newspaper is often considered an authoritative source on many matters.

All of that costs money, and The Times earns that by limiting access to its reporting through a robust paywall. In addition, each print edition has a copyright notification, the Times' terms of service limit the copying and use of any published material, and it can be selective about how it licenses its stories. In addition to driving revenue, these restrictions also help it to maintain its reputation as an authoritative voice by controlling how its works appear.

The suit alleges that OpenAI-developed tools undermine all of that. "By providing Times content without The Times's permission or authorization, Defendants' tools undermine and damage The Times's relationship with its readers and deprive The Times of subscription, licensing, advertising, and affiliate revenue," the suit alleges.

Part of the unauthorized use The Times alleges came during the training of various versions of GPT. Prior to GPT-3.5, information about the training dataset was made public. One of the sources used is a large collection of online material called "Common Crawl," which the suit alleges contains information from 16 million unique records from sites published by The Times. That places the Times as the third most references source, behind Wikipedia and a database of US patents.

OpenAI no longer discloses as many details of the data used for training of recent GPT versions, but all indications are that full-text NY Times articles are still part of that process. [...] Expect access to training information to be a major issue during discovery if this case moves forward.

Not just training

A number of suits have been filed regarding the use of copyrighted material during training of AI systems. But the Times' suite goes well beyond that to show how the material ingested during training can come back out during use. "Defendants' GenAI tools can generate output that recites Times content verbatim, closely summarizes it, and mimics its expressive style, as demonstrated by scores of examples," the suit alleges.


Original Submission #1Original Submission #2Original Submission #3

 
This discussion was created by martyb (76) for logged-in users only, but now has been archived. No new comments can be posted.
Display Options Threshold/Breakthrough Mark All as Read Mark All as Unread
The Fine Print: The following comments are owned by whoever posted them. We are not responsible for them in any way.
  • (Score: 5, Interesting) by Rich on Friday December 29 2023, @04:45PM (3 children)

    by Rich (945) on Friday December 29 2023, @04:45PM (#1338250) Journal

    Search engines copy for profit (e.g. Google Cache), too, and they copy even more verbatim than any AI that merely absorbs context of what it sees. To be able to update a particular web page, the old text must be purged from the word index and the new text must be added. This makes it unavoidable to keep a copy of the old text. The "robots.txt" "gentlemen's agreement" is a de facto legal practice that makes it possible to rip anything that's not explicitly excluded. The "payment" in exchange for your data was that (in the olden days) you could be found, or (today) receive monetizable clicks. With AI models trained on your data, you get nothing.

    With AI, the "amount of copying" is in theory less than what any search engine does for profit. Although, in practice, the AI operators certainly keep their "corpus" to train on stored as well.

    With a technology so important, letting a few random (or carefully picked?) judges create case law that entrenches what the players with the biggest legal budget want, leaves out the people, which according to the narrative can impose their will into laws. The implications of the technology should be discussed in the parliaments, and if laws need to be introduced to guide it, the lawmakers should provide clear written law on behalf of the people they represent. New proper written laws seem to be only enacted, when those players feel it's not enough what they can buy in courtrooms.

    Note that I didn't write what should be the outcome of this legal process, that is up to discussion and should be decided by majority vote. I'm sticking to the narrative here and leave out the issue of backroom lobbying and transparency, or lawmakers sitting idle to let case law be established by their puppeteers - but as is, the whole process is a failure of democracy. The ruleset to be established will be fallout from multinational plutocracy.

    Starting Score:    1  point
    Moderation   +3  
       Insightful=1, Interesting=2, Total=3
    Extra 'Interesting' Modifier   0  
    Karma-Bonus Modifier   +1  

    Total Score:   5  
  • (Score: 2) by mcgrew on Saturday December 30 2023, @12:31AM (2 children)

    by mcgrew (701) <publish@mcgrewbooks.com> on Saturday December 30 2023, @12:31AM (#1338299) Homepage Journal

    True, but all you have to do to keep Google or any other search engine from copying any of your page is to include a robots.txt file. But if Google can't find it, why publish it on the internet to begin with?

    --
    It is a disgrace that the richest nation in the world has hunger and homelessness.
    • (Score: 2) by Rich on Saturday December 30 2023, @12:15PM (1 child)

      by Rich (945) on Saturday December 30 2023, @12:15PM (#1338339) Journal

      Entirely legit question. Technically, when the search engines started in the late 90s, they committed the biggest copyright infringement ever, after there was a legal shift from "copyright must be registered" to "everything is copyrighted". But, no plaintiff, no judge. The whole "abandonware" thing is similar. The stuff is absolutely necessary for conserving digital history. Technically, it's completely illegal, but as of now, it seems to be tolerated.

      Basic copyright law, and the Berne convention predate any automated information processing, and there wasn't any major public consideration of how to deal with the information age in law. As technology progresses (photocopier, magnetic tape, computers, internet, machine learning), the public gets a few backroom deals shoved down their throats that fortify corporate power (TRIPS; DMCA, CTEA, and their international equivalents). But at no time anything was even discussed at a lawmaking level that the public would consider sensible. Like "when a vendor drops supply or support of something, it's free game", which might look like a mandatory principle for sustainable development.

      What I say is that the lawmakers should deal with such things and codify them on behalf of who they represent, rather than living in a case law world where a single judge gets to decide (another example is the Oracle vs. Google API case, btw).

      • (Score: 2) by mcgrew on Saturday December 30 2023, @06:50PM

        by mcgrew (701) <publish@mcgrewbooks.com> on Saturday December 30 2023, @06:50PM (#1338373) Homepage Journal

        The only problem is that in the US, democracy is dead. The 1% of those with the most incomes basically write the laws for your "representatives". It's a racket. "Nice campaign ya got there, Senator, shame if I was to give your opponent's campaign five hundred million and you five million instead of each of you getting two hundred fifty million."

        --
        It is a disgrace that the richest nation in the world has hunger and homelessness.