The New York Times on Wednesday filed a lawsuit against Microsoft and OpenAI, the company behind popular AI chatbot ChatGPT, accusing the companies of creating a business model based on "mass copyright infringement," stating their AI systems "exploit and, in many cases, retain large portions of the copyrightable expression contained in those works:"
Microsoft both invests in and supplies OpenAI, providing it with access to the Redmond, Washington, giant's Azure cloud computing technology.
The publisher said in a filing in the U.S. District Court for the Southern District of New York that it seeks to hold Microsoft and OpenAI to account for the "billions of dollars in statutory and actual damages" it believes it is owed for the "unlawful copying and use of The Times's uniquely valuable works."
[...] The Times said in an emailed statement that it "recognizes the power and potential of GenAI for the public and for journalism," but added that journalistic material should be used for commercial gain with permission from the original source.
"These tools were built with and continue to use independent journalism and content that is only available because we and our peers reported, edited, and fact-checked it at high cost and with considerable expertise," the Times said.
"Settled copyright law protects our journalism and content. If Microsoft and OpenAI want to use our work for commercial purposes, the law requires that they first obtain our permission. They have not done so."
[...] OpenAI has tried to allay news publishers concerns. In December, the company announced a partnership with Axel Springer — the parent company of Business Insider, Politico, and European outlets Bild and Welt — which would license its content to OpenAI in return for a fee.
Also at CNBC and The Guardian.
Previously:
NY Times sues Open AI, Microsoft over copyright infringement:
In August, word leaked out that The New York Times was considering joining the growing legion of creators that are suing AI companies for misappropriating their content. The Times had reportedly been negotiating with OpenAI regarding the potential to license its material, but those talks had not gone smoothly. So, eight months after the company was reportedly considering suing, the suit has now been filed.
The Times is targeting various companies under the OpenAI umbrella, as well as Microsoft, an OpenAI partner that both uses it to power its Copilot service and helped provide the infrastructure for training the GPT Large Language Model. But the suit goes well beyond the use of copyrighted material in training, alleging that OpenAI-powered software will happily circumvent the Times' paywall and ascribe hallucinated misinformation to the Times.
Journalism is expensive
The suit notes that The Times maintains a large staff that allows it to do things like dedicate reporters to a huge range of beats and engage in important investigative journalism, among other things. Because of those investments, the newspaper is often considered an authoritative source on many matters.
All of that costs money, and The Times earns that by limiting access to its reporting through a robust paywall. In addition, each print edition has a copyright notification, the Times' terms of service limit the copying and use of any published material, and it can be selective about how it licenses its stories. In addition to driving revenue, these restrictions also help it to maintain its reputation as an authoritative voice by controlling how its works appear.
The suit alleges that OpenAI-developed tools undermine all of that. "By providing Times content without The Times's permission or authorization, Defendants' tools undermine and damage The Times's relationship with its readers and deprive The Times of subscription, licensing, advertising, and affiliate revenue," the suit alleges.
Part of the unauthorized use The Times alleges came during the training of various versions of GPT. Prior to GPT-3.5, information about the training dataset was made public. One of the sources used is a large collection of online material called "Common Crawl," which the suit alleges contains information from 16 million unique records from sites published by The Times. That places the Times as the third most references source, behind Wikipedia and a database of US patents.
OpenAI no longer discloses as many details of the data used for training of recent GPT versions, but all indications are that full-text NY Times articles are still part of that process. [...] Expect access to training information to be a major issue during discovery if this case moves forward.
Not just training
A number of suits have been filed regarding the use of copyrighted material during training of AI systems. But the Times' suite goes well beyond that to show how the material ingested during training can come back out during use. "Defendants' GenAI tools can generate output that recites Times content verbatim, closely summarizes it, and mimics its expressive style, as demonstrated by scores of examples," the suit alleges.
Original Submission #1 Original Submission #2 Original Submission #3
(Score: 5, Interesting) by Rosco P. Coltrane on Friday December 29 2023, @03:24PM (1 child)
There. FTFY.
The cloud isn't about providing online services: it's about collecting as much private data as possible and monetizing it. It's always been about that.
The New York Times is absolutely right: it is a business model based on massive copyright infringement. But it's wrong on two things: it's not new, and it's not just Microsoft and OpenAI. It's been going on for decades, and all Big Data players essentially owe their very existence to the business of exploiting data they have no right to exploit.
The difference between then and now is that the data they had no right to exploit wasn't exploited directly: it was digested and used indirectly for the purpose of advertisement. For example, when Google's surveillance collective has your medical file because your healthcare provider put your medical data in their cloud, and it knows you have some disease, and you keep getting advertisement more or less closely related to that disease, you have a hunch that Google is using data it shouldn't be using but you can't prove it.
With AI, you can prove the infringement clear as day: chat with the stupid bot long enough and it will regurgitate your own data back to you verbatim. That's the difference.
(Score: 2) by The Vocal Minority on Sunday December 31 2023, @05:58AM
If you have any actual proof this is happening then please provide it. The contractual arrangements around the use of Azure, and other similar cloud platforms, provide garantees of privacy for customer data, otherwise they wouldn't use them. I also believe the data is actually encrypted at rest with the private key in the customers possession. This is not gmail where google explicitly tells you they are going to look through your e-mails.
Yes, there are no grantees and trust is required that the cloud infrastructure is actually doing what you are told it is doing. But that is no different from running close source software in general and/or using a third party data center.
Personally, I am very suspicious of these cloud platforms and I think they give big American tech companies way too much power, but if I am to convince people not to use then the I need proof that these privacy abuses are happening. Otherwise it is all just a bunch of paranoid ranting.
(Score: 4, Interesting) by Runaway1956 on Friday December 29 2023, @03:37PM (10 children)
MS is pretty damned big. AI is bigger than Microsoft - almost everyone is investing in it. NYT has a lot of legal firepower to bring to bear. Time to sit back, and watch the fireworks. On the one hand, we want to see current copyright law seriously crippled. On the other hand, we'd like to see AI wither on the vine and die. Maybe we'll get lucky, and the entire publishing world joins with NYT against all of AI, and they mutually extinguish each other.
In the aftermath, just maybe some reasonable legislation regarding copyright as well as AI are passed? Not much chance, but maybe. I can dream, can't I?
How do we get the advertising industry involved with all of this? They need to take some serious damage from it, somehow.
“I have become friends with many school shooters” - Tampon Tim Walz
(Score: 2) by looorg on Friday December 29 2023, @03:59PM (2 children)
> How do we get the advertising industry involved with all of this? They need to take some serious damage from it, somehow.
What the heck are "the copyrightable expression"? Is that slogans or what? I would gather the same thing tho, IF this works for the NYT then anything print, audio, visual will go after OpenAI and the others that have harvested data for their AI for a free payday. This will include advertisers. Certainly so if these "copyrightable expressions" are a thing, that sounds like something from advertisement. That said that will just be another payday for them and not actually a loss.
(Score: 2) by mcgrew on Friday December 29 2023, @10:52PM (1 child)
What the heck are "the copyrightable expression"?
Simple, it's a bullshit phrase and means nothing. For anything to be covered by copyright in the US you have to register it with the Library of Congress. Not everything can be copyrighted; two examples are food recipes and clothing patterns.
Impeach Donald Saruman and his sidekick Elon Sauron
(Score: 1, Informative) by Anonymous Coward on Saturday December 30 2023, @12:00AM
> For anything to be covered by copyright in the US you have to register it with the Library of Congress.
Not true, please read,
https://www.copyright.gov/help/faq/faq-general.html [copyright.gov]
To clarify/correct your statement, here are two of the faq's,
(Score: 0) by Anonymous Coward on Friday December 29 2023, @03:59PM
There is zero chance with today's political climate that anything good will come out of
the latest tech money grab.
(Score: 4, Insightful) by bzipitidoo on Friday December 29 2023, @05:53PM (3 children)
I, too, wish to see big changes in copyright and related law. But the law and lawmakers are hidebound. They're going to continue having these stupid fights over the ownership and control of immaterial things that shouldn't be controlled at all. That won't change until we the people make them change.
An interesting aspect is that this is an issue over which the media cannot maintain a detached stance and perform unbiased reporting. They believe copyright is their bread and butter, and they slant their reporting accordingly, while doing all they can to appear properly neutral, balanced and fair. So deeply ingrained is the thinking that they don't see this about themselves, not on this matter.
(Score: 5, Interesting) by mcgrew on Friday December 29 2023, @10:56PM (2 children)
My take on copyright is it took almost a whole year to write that damned book. I wrote it to be read, not to make money on. But if you profit from it, I should get a very huge chunk of the profit.
THAT is the purpose of copyright, but since 1900 it has been terribly perverted.
Impeach Donald Saruman and his sidekick Elon Sauron
(Score: 2) by bzipitidoo on Saturday December 30 2023, @03:38AM (1 child)
With this, I agree. Creators should be compensated for their work. The problem is that copyright isn't a good means to that end. It still works, somewhat, but not well. I argue that it never did work well, and it causes lots of other problems. It has warped our society. (I've written an essay about how copyright has warped us, which I suppose I should post to my blog.) But in the past, alternatives were lacking. Basically, the only viable alternative was patronage. Live performance was also done, but that was severely limited by the lack of such things as microphones and amplifiers. I have read that about the largest audience that could be accommodated by an amphitheater with good acoustics was 3000, and the lack of sanitation, transportation, and communication made it both harder and riskier to assemble such a crowd.
Now, however, changes in technology have given us more options even as it has made copyright untenable. In past centuries, patronage was accessible only to the wealthy, both individuals and groups. For instance, pretty much every large city supports an orchestra. But now, we can crowdfund. I have bought Humble Bundles for just one item. I'd say I've played maybe 5% of the games I've bought through Humble Bundle, and that's okay, the bundles were so low cost that I don't mind. Happy to help crowdfund.
By the way, I keep meaning to give your fiction a read. Haven't gotten around to it.
(Score: 2) by mcgrew on Saturday December 30 2023, @06:45PM
Copyright worked well until the mouse got its claws on it. IIRC In 1900 the copyright term was twenty years, plenty long enough to make any cash. After twenty years it went into the public domain, and anyone with a printing press could publish it for free.
It was never about copying. Copyright was always about publishing. As to music, sheet music could be copyrighted but not songs, as sheet music was the only way of recording music before the twentieth century.
Impeach Donald Saruman and his sidekick Elon Sauron
(Score: 2) by takyon on Friday December 29 2023, @06:48PM
Corporate AI may die, open source AI will live on.
[SIG] 10/28/2017: Soylent Upgrade v14 [soylentnews.org]
(Score: 4, Informative) by mcgrew on Friday December 29 2023, @10:49PM
On the other hand, we'd like to see AI wither on the vine and die.
As opposed to true creativity. I think this snippet from the new book is Germaine here:
Copyright shouldn't go away, but it needs to go back to what it was in 1880 (with the exception of the Home Recording Act of 1978). AI has its uses, but those who use it must be on a tight leash. I'm thinking of something I saw on Fakebook,"rather than bringing us Star Trek, the space billionaires seem to want Dune."
Impeach Donald Saruman and his sidekick Elon Sauron
(Score: 4, Troll) by DannyB on Friday December 29 2023, @04:24PM (4 children)
If you don't want people or machines to read your content, or view your images . . .
DON'T PUT THEM ONLINE !!!
The Centauri traded Earth jump gate technology in exchange for our superior hair mousse formulas.
(Score: 1, Funny) by Anonymous Coward on Friday December 29 2023, @05:35PM
How's that cave working for you? Drafty at all?
I get what you're saying and all, but that ship has sailed. That other boat waiting for you is the B Ark.
(Score: 4, Insightful) by stormreaver on Friday December 29 2023, @07:34PM
They want people (not so much machines) to read their content, but they want to be paid for it (creating it is costly). That is totally reasonable. That OpenAI was caught red-handed holding the smoking gun (despite persistently lying about possessing the capability to hold one) is about as strong as a case can get.
That said, the entire court experience is a case study in entropy at work, so I will not make a prediction. However, I hope OpenAI and Microsoft get taken to the cleaners.
(Score: 4, Interesting) by mcgrew on Saturday December 30 2023, @12:05AM (1 child)
Copyright isn't about keeping you from reading the content, despite what the Music And Film Association of America (MAFIAA) would have you believe. It protects the author from the publisher, not from the reader. In the case of music, congress specifically legalized home recording in the US in the 1978 Home Recording Act. This despite the fact that copyright was always about publishing, not copying.
Do you want to read my book? You might not have a chance if anybody can make money publishing it except me. It might not have even been written.
That said, I give my stuff away. I'm trying to sell copies of the new one to join the SFWA, it will be free in September or sooner and I abhor Digital Restrictions Management and don't use it. But there aren't many like me, Harry Potter might have never existed had it not been for England's welfare laws.
Impeach Donald Saruman and his sidekick Elon Sauron
(Score: 0) by Anonymous Coward on Saturday December 30 2023, @01:44PM
> Copyright isn't about keeping you from reading the content, despite what the Music And Film Association of America (MAFIAA) would have you believe. It protects the author from the publisher, not from the reader.
Wouldn't it be closer to correct to say that:
a. (C) protects from unscrupulous publishers who would publish a work without entering into a contract with the creator.
b. The barriers of entry to becoming a publisher have changed greatly with improved technology. With consumer audio/video tape, photocopy machines and now digital copies, just about anyone can be a publisher. It's now so easy that most people who copy works may not even recognize that they are publishing.
(Score: 5, Informative) by ElizabethGreene on Friday December 29 2023, @04:45PM (21 children)
There are some interesting things in the complaint here. The NYT claims to have convinced ChatGpt to return substantial portions of the original full text of multiple articles.
Complaint PDF [nytimes.com] Pages 30-40.
(Score: 2) by Ox0000 on Friday December 29 2023, @05:30PM (2 children)
+1 Informative.
That's uncanny...
(Score: 0) by Anonymous Coward on Friday December 29 2023, @06:07PM
To the GP, thanks for digging into the text.
> That's uncanny...
That's not very surprising...others have reported similar in recent weeks.
ftfy
(Score: 2) by mcgrew on Saturday December 30 2023, @12:11AM
That's uncanny...
I see you're not a database guy, it's not surprising at all.
Here's [soylentnews.org] how AI works. A journal from March explaining the magic.
Impeach Donald Saruman and his sidekick Elon Sauron
(Score: 4, Interesting) by krishnoid on Friday December 29 2023, @06:06PM (13 children)
There are also some interesting things in the complaint here. The NYT claims to have convinced ChatGPT to return substantial portions of the original full text of multiple comments.
(Score: 1) by MonkeypoxBugChaser on Friday December 29 2023, @07:06PM (11 children)
Yea, likely through an exploit. That whole repeat "poem" forever type thing.
(Score: 2) by HiThere on Friday December 29 2023, @07:21PM (10 children)
OK, but it still implies that ChatGPT memorized the articles in some sense. That's definitely making a copy, just like you do when you reread a poem several times, or a favorite author. (A friend knew someone who could essentially recite the Lord of the Rings. I can do pieces of it, largely poems.)
So the problem is that if NYT wins the case, the next target may be remembering stuff. I don't really see a clear demarcation.
Javascript is what you use to allow unknown third parties to run software you have no idea about on your computer.
(Score: 2) by mcgrew on Saturday December 30 2023, @12:27AM (5 children)
Computers don't "think" any more than a printed book thinks. Stored information is not thought, except that it's the thought of the original thinker who write it down or typed it into a computer.
I should write an article about copyright, since so few seem to have any idea about it, thanks to the MAFIAA. Copyright is NOT about copying or storing data, it's about publishing it. And it's not automatic in the US thanks to (I think, information is hard for me to find) a lawsuit, as Bowker (the ISBN people) informed me.
If I put my book on the internet, I have published it. Copyright gives me a "limited time*" monopoly on publication. It has to be registered and costs sixty bucks to register in the US. Recording that Metallica album and giving a copy to your friend is perfectly legal, no matter what that greedy asshole Lars Ulrich thinks.
* The Bono Act gives me a "limited time" monopoly of ninety five tears longer than my life. I don't see how I'm going to be enticed to write any more books after I'm dead. SCOTUS ruled against common sense and logic, ruling that "limited" means whatever congress says it means.
Impeach Donald Saruman and his sidekick Elon Sauron
(Score: 2) by HiThere on Saturday December 30 2023, @01:40AM (4 children)
That's an assertion I've heard before, but I've never seen any good proof of it.
Actually, proof is slightly the wrong word. What's missing is a definition of "thought" that includes what people do, doesn't include what computers to, and doesn't depend on handling them as a special case. The first version of that assertion that I heard was that computers will never play good chess because that can't think. The assertion that they couldn't play good chess was already false at the time, though they weren't up to expert level.
So give me your explicit definition and perhaps I'll accept that, by your usage, computer can't think. Otherwise I'll just remember the old saying in AI that "intelligence is whatever we haven't managed to do yet".
Javascript is what you use to allow unknown third parties to run software you have no idea about on your computer.
(Score: 3, Informative) by mcgrew on Saturday December 30 2023, @02:27AM (3 children)
I put it succinctly in the story Sentience [mcgrewbooks.com]. It's written in the first person, the narrator is a sentient computer.
My views that there will never be a Turing architecture sentient computer comes mostly from the fact that I've studied the schematics of computer components like the ALU (Arithemetic Logic Unit) and written a two player battle tanks game in Z-80 machine code. A computer is no smarter than a printed book.
Now, replicants, like in RUR [mcgrewbooks.com], with history's first use of the word "robot", or Do Androids Dream of Electric Sheep? ("Blade Runner") may and probably will be sentient.
Impeach Donald Saruman and his sidekick Elon Sauron
(Score: 2) by deimtee on Sunday December 31 2023, @02:48AM (2 children)
You are showing an organic bias. There is nothing the cells in the brain do that can't be done on a computer. We just haven't written a program that complicated yet. (Well, publicly at least, I don't know what the TLA's have.)
Reductio ad absurdum:
We can write a program to simulate a neuron. We can write a program to simulate an axon. We can design a message passing algorithm that simulates the interconnections. We can design self-modifying programs that mimic the changes in neurons and axons as they are used.
We can freeze a brain and examine it neuron by neuron and reproduce the neurons and interconnections in it in silicon and programming. It would take a massive effort and a huge amount of computer power but when you turned that program/machine on it would produce the same output as the brain that was scanned.
The main difference is that a book is static. A computer can execute code and change the stored information. The glider gun in the Life program demonstrates that even a very simple system can have unlimited growth. It's not intelligent, but neither is a bacterium. You have to build up to intelligence. As far as I know, the simulationists have got as far as a small worm with a few neurons. I think there is a group working on simulating a fly's brain.
One job constant is that good employers have low turnover, so opportunities to join good employers are relatively rare.
(Score: 1, Insightful) by Anonymous Coward on Monday January 01 2024, @02:09AM
late to the party, but in case anyone is still reading..
> but when you turned that program/machine on it would produce the same output as the brain that was scanned.
Um, yes. But don't forget gigo. The inputs to the human brain are also somehow encoded, and not much of this is understood yet either. How many processing layers are in the eye, before any signals are sent down the optic nerve? Same applies to I/O with all the other organs both internal and near the skin. Without all the I/O a synthetic brain isn't going to be useful.
Back to the drawing board.
(Score: 2) by mcgrew on Sunday January 07 2024, @06:51PM
There is nothing the cells in the brain do that can't be done on a computer.
Fractions. Divide one by three on a computer. Making anything actually original.
Impeach Donald Saruman and his sidekick Elon Sauron
(Score: 0) by Anonymous Coward on Saturday December 30 2023, @04:13PM
And humans are responsible for any copyright infringement they make. So some people won't be producing infringing copies of that stuff even if they have good enough memory to do so.
If Microsoft provided proof that they trained their AI on their own source code ( Windows, Microsoft Office etc) AND then publicly guaranteed that the output of their AI can be used without any copyright issues, especially guaranteeing that any output won't be infringing on Microsoft's copyright. Then sure I might start having a bit more confidence that they're not infringing. And if it happens to output useful Win32 stuff that WINE and ReactOS can now use legally well too bad for Microsoft.
But instead they train their AI on OTHER people's copyrighted stuff and say that they are not infringing "because AI". To me that's laundering copyright infringement (e.g. GPLed stuff): https://www.theverge.com/2022/11/8/23446821/microsoft-openai-github-copilot-class-action-lawsuit-ai-copyright-violation-training-data
As the "poem" exploit confirms, these type of AIs have/produce infringing copies of stuff.
Some idiots argue it's not infringement because the actual stored data doesn't look like the copyrighted stuff and is a lot smaller. If that's a good enough excuse then if I convert a copyrighted Blu-ray to HEVC, I won't be infringing since the data stored and distributed is now very different and a lot smaller. And yes it's provably lossy too - in many cases the output is not 100% the same. But nope, it's still considered infringement.
(Score: 2) by maxwell demon on Sunday December 31 2023, @11:47AM (2 children)
If I reproduce large chunks of an article from memory and give them to whoever wants them, I'm already violating copyright. It doesn't matter that I first memorized the text and then wrote it down on request instead of writing it down as I read it.
The Tao of math: The numbers you can count are not the real numbers.
(Score: 2) by HiThere on Sunday December 31 2023, @05:31PM (1 child)
So singing a song is violation of copyright. Somehow I didn't think copyright law was quite that stupid.
Javascript is what you use to allow unknown third parties to run software you have no idea about on your computer.
(Score: 0) by Anonymous Coward on Monday January 01 2024, @02:11AM
old news, see https://support.easysong.com/hc/en-us/articles/360047682433-What-is-a-Public-Performance-License- [easysong.com]
(Score: 2, Funny) by cereal_burpist on Saturday December 30 2023, @05:28AM
Whoever purchases Hyperloop's deserted test track in the state of Nevada will have an exceptionally large children's water toy, if they so desire.
(Score: 5, Funny) by stormreaver on Friday December 29 2023, @07:57PM
The Times must be lying, as those good, honest, God-fearing people at Microsoft and OpenAI would never mislead and abuse the public.
(Score: 3, Funny) by deimtee on Friday December 29 2023, @09:26PM
Wow. It's like the New York Times got ChatGPT to write their articles.
One job constant is that good employers have low turnover, so opportunities to join good employers are relatively rare.
(Score: 2) by hendrikboom on Saturday December 30 2023, @12:44AM (1 child)
Returning those documents? Does it mean anything, with the AI's already trained?
(Score: 2) by cereal_burpist on Saturday December 30 2023, @04:35AM
(Score: 5, Interesting) by Rich on Friday December 29 2023, @04:45PM (3 children)
Search engines copy for profit (e.g. Google Cache), too, and they copy even more verbatim than any AI that merely absorbs context of what it sees. To be able to update a particular web page, the old text must be purged from the word index and the new text must be added. This makes it unavoidable to keep a copy of the old text. The "robots.txt" "gentlemen's agreement" is a de facto legal practice that makes it possible to rip anything that's not explicitly excluded. The "payment" in exchange for your data was that (in the olden days) you could be found, or (today) receive monetizable clicks. With AI models trained on your data, you get nothing.
With AI, the "amount of copying" is in theory less than what any search engine does for profit. Although, in practice, the AI operators certainly keep their "corpus" to train on stored as well.
With a technology so important, letting a few random (or carefully picked?) judges create case law that entrenches what the players with the biggest legal budget want, leaves out the people, which according to the narrative can impose their will into laws. The implications of the technology should be discussed in the parliaments, and if laws need to be introduced to guide it, the lawmakers should provide clear written law on behalf of the people they represent. New proper written laws seem to be only enacted, when those players feel it's not enough what they can buy in courtrooms.
Note that I didn't write what should be the outcome of this legal process, that is up to discussion and should be decided by majority vote. I'm sticking to the narrative here and leave out the issue of backroom lobbying and transparency, or lawmakers sitting idle to let case law be established by their puppeteers - but as is, the whole process is a failure of democracy. The ruleset to be established will be fallout from multinational plutocracy.
(Score: 2) by mcgrew on Saturday December 30 2023, @12:31AM (2 children)
True, but all you have to do to keep Google or any other search engine from copying any of your page is to include a robots.txt file. But if Google can't find it, why publish it on the internet to begin with?
Impeach Donald Saruman and his sidekick Elon Sauron
(Score: 2) by Rich on Saturday December 30 2023, @12:15PM (1 child)
Entirely legit question. Technically, when the search engines started in the late 90s, they committed the biggest copyright infringement ever, after there was a legal shift from "copyright must be registered" to "everything is copyrighted". But, no plaintiff, no judge. The whole "abandonware" thing is similar. The stuff is absolutely necessary for conserving digital history. Technically, it's completely illegal, but as of now, it seems to be tolerated.
Basic copyright law, and the Berne convention predate any automated information processing, and there wasn't any major public consideration of how to deal with the information age in law. As technology progresses (photocopier, magnetic tape, computers, internet, machine learning), the public gets a few backroom deals shoved down their throats that fortify corporate power (TRIPS; DMCA, CTEA, and their international equivalents). But at no time anything was even discussed at a lawmaking level that the public would consider sensible. Like "when a vendor drops supply or support of something, it's free game", which might look like a mandatory principle for sustainable development.
What I say is that the lawmakers should deal with such things and codify them on behalf of who they represent, rather than living in a case law world where a single judge gets to decide (another example is the Oracle vs. Google API case, btw).
(Score: 2) by mcgrew on Saturday December 30 2023, @06:50PM
The only problem is that in the US, democracy is dead. The 1% of those with the most incomes basically write the laws for your "representatives". It's a racket. "Nice campaign ya got there, Senator, shame if I was to give your opponent's campaign five hundred million and you five million instead of each of you getting two hundred fifty million."
Impeach Donald Saruman and his sidekick Elon Sauron
(Score: 1) by MonkeypoxBugChaser on Friday December 29 2023, @07:02PM (6 children)
On the one hand I dislike OpenAI, it's alignment, bias and confident lying. On the other hand I hate the NYT, it's spreading of propaganda and propping up of dictators.
Think I have to go with Altman on this one though. I'd rather there be large language models, especially open source ones. Those can't be trained so easily at home and this ruling would lead to a whole chilling effect.
Even when all NYT data is purged (the model will be better) everyone else will ask the same; the model will be worse....
(Score: 3, Insightful) by HiThere on Friday December 29 2023, @07:23PM (4 children)
Perhaps it would be better if the LLMs were only trained on stuff that was out of copyright, or dedicated to the public domain. If Harry Potter is a good choice, why not Tom Swift or the Oz series?
Javascript is what you use to allow unknown third parties to run software you have no idea about on your computer.
(Score: 2) by takyon on Friday December 29 2023, @07:41PM (3 children)
*IF* full articles and books are popping out of highly compressed LLMs, then there was probably a lot of duplication of the text. Same with copyrighted photos and other unique images popping out of Stable Diffusion. Manage the data better and there's no problem with using Harry Potter as one of 100 billion things in the training.
Alternatively, the chatbots are being allowed to find paywalled and copyrighted content where it resides on the live Web (for example, archive sites for NYT articles) and they are reproducing that. Lawsuits against Google News are similar.
I think we're just going to have to wait for the Supreme Court to pick a winner. These companies might be making public domain models in parallel to prepare for a doomsday ruling.
[SIG] 10/28/2017: Soylent Upgrade v14 [soylentnews.org]
(Score: 1, Funny) by Anonymous Coward on Friday December 29 2023, @10:28PM
OR
Times reporters are using ChatGPT to write their articles.
(Score: 0) by Anonymous Coward on Sunday December 31 2023, @02:58AM (1 child)
Actually, if you read the pages Elizabeth cited above, the articles are very similar but with occasional word choice differences and missing/added adjectives.
If you handed one in at school you absolutely would get busted for plagiarism, but if you told 200 english majors "write a 200 word essay on the use of potatoes in the panhandle, in the depression, written in the style of John Steinbeck" you'd probably get some very similar results too.
(Score: 2) by takyon on Sunday December 31 2023, @05:35AM
That's the way it was with duplicated data in Stable Diffusion training sets:
https://arstechnica.com/information-technology/2023/02/researchers-extract-training-images-from-stable-diffusion-but-its-difficult/ [arstechnica.com]
https://cdn.arstechnica.net/wp-content/uploads/2023/02/image_extraction_hero_1.jpg [arstechnica.net]
When the same image makes its way into the training set a bunch of times, you can get output that is obviously recognizable as a copy but not pixel perfect.
[SIG] 10/28/2017: Soylent Upgrade v14 [soylentnews.org]
(Score: 3, Insightful) by Anonymous Coward on Saturday December 30 2023, @12:25AM
> On the other hand I hate the NYT ...
They do have their good points, for example this recent exposure of very lax workplace auditing. If true the auditing has routinely been missing serious child labor abuses, in the USA:
https://www.nytimes.com/2023/12/28/us/migrant-child-labor-audits.html [nytimes.com]
It's behind a paywall, but given the lax behavior of many archive sites, you shouldn't have too much trouble finding a copy(grin). I read it on paper, syndicated in my local newspaper, but was able to grab a bit by a quick ctrl-a, ctrl-c copy before the sign-up page covered it.
There aren't many news organizations left that are willing to go out and research things like this. You won't find Google News or ChatGPT doing it, that's for sure. You may have heard the press called the "fourth estate"? https://en.wikipedia.org/wiki/Fourth_Estate [wikipedia.org]