Stories
Slash Boxes
Comments

SoylentNews is people

posted by martyb on Friday December 29 2023, @03:13PM   Printer-friendly

New York Times Sues Microsoft, ChatGPT Maker OpenAI Over Copyright Infringement

The New York Times on Wednesday filed a lawsuit against Microsoft and OpenAI, the company behind popular AI chatbot ChatGPT, accusing the companies of creating a business model based on "mass copyright infringement," stating their AI systems "exploit and, in many cases, retain large portions of the copyrightable expression contained in those works:"

Microsoft both invests in and supplies OpenAI, providing it with access to the Redmond, Washington, giant's Azure cloud computing technology.

The publisher said in a filing in the U.S. District Court for the Southern District of New York that it seeks to hold Microsoft and OpenAI to account for the "billions of dollars in statutory and actual damages" it believes it is owed for the "unlawful copying and use of The Times's uniquely valuable works."

[...] The Times said in an emailed statement that it "recognizes the power and potential of GenAI for the public and for journalism," but added that journalistic material should be used for commercial gain with permission from the original source.

"These tools were built with and continue to use independent journalism and content that is only available because we and our peers reported, edited, and fact-checked it at high cost and with considerable expertise," the Times said.

"Settled copyright law protects our journalism and content. If Microsoft and OpenAI want to use our work for commercial purposes, the law requires that they first obtain our permission. They have not done so."

[...] OpenAI has tried to allay news publishers concerns. In December, the company announced a partnership with Axel Springer — the parent company of Business Insider, Politico, and European outlets Bild and Welt — which would license its content to OpenAI in return for a fee.

Also at CNBC and The Guardian.

Previously:

NY Times Sues Open AI, Microsoft Over Copyright Infringement

NY Times sues Open AI, Microsoft over copyright infringement:

In August, word leaked out that The New York Times was considering joining the growing legion of creators that are suing AI companies for misappropriating their content. The Times had reportedly been negotiating with OpenAI regarding the potential to license its material, but those talks had not gone smoothly. So, eight months after the company was reportedly considering suing, the suit has now been filed.

The Times is targeting various companies under the OpenAI umbrella, as well as Microsoft, an OpenAI partner that both uses it to power its Copilot service and helped provide the infrastructure for training the GPT Large Language Model. But the suit goes well beyond the use of copyrighted material in training, alleging that OpenAI-powered software will happily circumvent the Times' paywall and ascribe hallucinated misinformation to the Times.

Journalism is expensive

The suit notes that The Times maintains a large staff that allows it to do things like dedicate reporters to a huge range of beats and engage in important investigative journalism, among other things. Because of those investments, the newspaper is often considered an authoritative source on many matters.

All of that costs money, and The Times earns that by limiting access to its reporting through a robust paywall. In addition, each print edition has a copyright notification, the Times' terms of service limit the copying and use of any published material, and it can be selective about how it licenses its stories. In addition to driving revenue, these restrictions also help it to maintain its reputation as an authoritative voice by controlling how its works appear.

The suit alleges that OpenAI-developed tools undermine all of that. "By providing Times content without The Times's permission or authorization, Defendants' tools undermine and damage The Times's relationship with its readers and deprive The Times of subscription, licensing, advertising, and affiliate revenue," the suit alleges.

Part of the unauthorized use The Times alleges came during the training of various versions of GPT. Prior to GPT-3.5, information about the training dataset was made public. One of the sources used is a large collection of online material called "Common Crawl," which the suit alleges contains information from 16 million unique records from sites published by The Times. That places the Times as the third most references source, behind Wikipedia and a database of US patents.

OpenAI no longer discloses as many details of the data used for training of recent GPT versions, but all indications are that full-text NY Times articles are still part of that process. [...] Expect access to training information to be a major issue during discovery if this case moves forward.

Not just training

A number of suits have been filed regarding the use of copyrighted material during training of AI systems. But the Times' suite goes well beyond that to show how the material ingested during training can come back out during use. "Defendants' GenAI tools can generate output that recites Times content verbatim, closely summarizes it, and mimics its expressive style, as demonstrated by scores of examples," the suit alleges.


Original Submission #1Original Submission #2Original Submission #3

 
This discussion was created by martyb (76) for logged-in users only, but now has been archived. No new comments can be posted.
Display Options Threshold/Breakthrough Mark All as Read Mark All as Unread
The Fine Print: The following comments are owned by whoever posted them. We are not responsible for them in any way.
  • (Score: 1) by MonkeypoxBugChaser on Friday December 29 2023, @07:02PM (6 children)

    by MonkeypoxBugChaser (17904) on Friday December 29 2023, @07:02PM (#1338270) Homepage Journal

    On the one hand I dislike OpenAI, it's alignment, bias and confident lying. On the other hand I hate the NYT, it's spreading of propaganda and propping up of dictators.

    Think I have to go with Altman on this one though. I'd rather there be large language models, especially open source ones. Those can't be trained so easily at home and this ruling would lead to a whole chilling effect.

    Even when all NYT data is purged (the model will be better) everyone else will ask the same; the model will be worse....

  • (Score: 3, Insightful) by HiThere on Friday December 29 2023, @07:23PM (4 children)

    by HiThere (866) on Friday December 29 2023, @07:23PM (#1338273) Journal

    Perhaps it would be better if the LLMs were only trained on stuff that was out of copyright, or dedicated to the public domain. If Harry Potter is a good choice, why not Tom Swift or the Oz series?

    --
    Javascript is what you use to allow unknown third parties to run software you have no idea about on your computer.
    • (Score: 2) by takyon on Friday December 29 2023, @07:41PM (3 children)

      by takyon (881) <takyonNO@SPAMsoylentnews.org> on Friday December 29 2023, @07:41PM (#1338275) Journal

      *IF* full articles and books are popping out of highly compressed LLMs, then there was probably a lot of duplication of the text. Same with copyrighted photos and other unique images popping out of Stable Diffusion. Manage the data better and there's no problem with using Harry Potter as one of 100 billion things in the training.

      Alternatively, the chatbots are being allowed to find paywalled and copyrighted content where it resides on the live Web (for example, archive sites for NYT articles) and they are reproducing that. Lawsuits against Google News are similar.

      I think we're just going to have to wait for the Supreme Court to pick a winner. These companies might be making public domain models in parallel to prepare for a doomsday ruling.

      --
      [SIG] 10/28/2017: Soylent Upgrade v14 [soylentnews.org]
      • (Score: 1, Funny) by Anonymous Coward on Friday December 29 2023, @10:28PM

        by Anonymous Coward on Friday December 29 2023, @10:28PM (#1338279)

        OR
        Times reporters are using ChatGPT to write their articles.

      • (Score: 0) by Anonymous Coward on Sunday December 31 2023, @02:58AM (1 child)

        by Anonymous Coward on Sunday December 31 2023, @02:58AM (#1338426)

        Actually, if you read the pages Elizabeth cited above, the articles are very similar but with occasional word choice differences and missing/added adjectives.

        If you handed one in at school you absolutely would get busted for plagiarism, but if you told 200 english majors "write a 200 word essay on the use of potatoes in the panhandle, in the depression, written in the style of John Steinbeck" you'd probably get some very similar results too.

  • (Score: 3, Insightful) by Anonymous Coward on Saturday December 30 2023, @12:25AM

    by Anonymous Coward on Saturday December 30 2023, @12:25AM (#1338296)

    > On the other hand I hate the NYT ...

    They do have their good points, for example this recent exposure of very lax workplace auditing. If true the auditing has routinely been missing serious child labor abuses, in the USA:
        https://www.nytimes.com/2023/12/28/us/migrant-child-labor-audits.html [nytimes.com]

    It's behind a paywall, but given the lax behavior of many archive sites, you shouldn't have too much trouble finding a copy(grin). I read it on paper, syndicated in my local newspaper, but was able to grab a bit by a quick ctrl-a, ctrl-c copy before the sign-up page covered it.

    (a photo caption) Miguel Sanchez, 17, came alone to the United States and has been working at an industrial dairy for about two years.Credit...Ruth Fremson/The New York Times

    They’re Paid Billions to Root Out Child Labor in the U.S. Why Do They Fail?

    Private auditors have failed to detect migrant children working for U.S. suppliers of Oreos, Gerber baby snacks, McDonald’s milk and many other products.
    By Hannah Dreier Dec. 28, 2023

    One morning in 2019, an auditor arrived at a meatpacking plant in rural Minnesota. He was there on behalf of the national drugstore chain Walgreens to ensure that the factory, which made the company’s house brand of beef jerky, was safe and free of labor abuses.

    He ran through a checklist of hundreds of possible problems, like locked emergency exits, sexual harassment and child labor. By the afternoon, he had concluded that the factory had no major violations. It could keep making jerky, and Walgreens customers could shop with a clear conscience.

    When night fell, another 150 workers showed up at the plant. Among them were migrant children who had come to the United States by themselves looking for work. Children as young as 15 were operating heavy machinery capable of amputating fingers and crushing bones.

    Migrant children would work at the Monogram Meat Snacks plant in Chandler, Minn., for almost four more years, until the Department of Labor visited this spring and found such severe child labor violations that it temporarily banned the shipment of any more jerky.

    There aren't many news organizations left that are willing to go out and research things like this. You won't find Google News or ChatGPT doing it, that's for sure. You may have heard the press called the "fourth estate"? https://en.wikipedia.org/wiki/Fourth_Estate [wikipedia.org]