Arthur T Knackerbracket has processed the following story:
The music industry’s lawsuit sends the loudest message yet: High-quality training data is not free.
The generative AI boom is built on scale. The more training data, the more powerful the model.
But there’s a problem. AI companies have pillaged the internet for training data, and many websites and data set owners have started restricting the ability to scrape their websites. We’ve also seen a backlash against the AI sector’s practice of indiscriminately scraping online data, in the form of users opting out of making their data available for training and lawsuits from artists, writers, and the New York Times, claiming that AI companies have taken their intellectual property without consent or compensation.
Last week three major record labels—Sony Music, Warner Music Group, and Universal Music Group—announced they were suing the AI music companies Suno and Udio over alleged copyright infringement. The music labels claim the companies made use of copyrighted music in their training data “at an almost unimaginable scale,” allowing the AI models to generate songs that “imitate the qualities of genuine human sound recordings.
But this moment also sets an interesting precedent for all of generative AI development. Thanks to the scarcity of high-quality data and the immense pressure and demand to build even bigger and better models, we’re in a rare moment where data owners actually have some leverage. The music industry’s lawsuit sends the loudest message yet: High-quality training data is not free.
It will likely take a few years at least before we have legal clarity around copyright law, fair use, and AI training data. But the cases are already ushering in changes. OpenAI has been striking deals with news publishers such as Politico, the Atlantic, Time, the Financial Times, and others, and exchanging publishers’ news archives for money and citations. And YouTube announced in late June that it will offer licensing deals to top record labels in exchange for music for training.
These changes are a mixed bag. On one hand, I’m concerned that news publishers are making a Faustian bargain with AI. For example, most of the media houses that have made deals with OpenAI say the deal stipulates that OpenAI cite its sources. But language models are fundamentally incapable of being factual and are best at making things up. Reports have shown that ChatGPT and the AI-powered search engine Perplexity frequently hallucinate citations, which makes it hard for OpenAI to honor its promises.
It’s tricky for AI companies too. This shift could lead to them build smaller, more efficient models, which are far less polluting. Or they may fork out a fortune to access data at the scale they need to build the next big one. Only the companies most flush with cash, and/or with large existing data sets of their own (such as Meta, with its two decades of social media data), can afford to do that. So the latest developments risk concentrating power even further into the hands of the biggest players.
On the other hand, the idea of introducing consent into this process is a good one—not just for rights holders, who can benefit from the AI boom, but for all of us. We should all have the agency to decide how our data is used, and a fairer data economy would mean we could all benefit.
(Score: 3, Insightful) by JoeMerchant on Saturday July 06, @11:50AM (3 children)
...until AI showed the promise of making money, virtually nobody bothered to complain. Now that it has developed to a point that people are willing to pay for it, everybody wants a piece of the action.
It is the same basic principle as business taxes, income taxes, property taxes, protection rackets, royalties and every other tax since the first pimp muscled in on income earned from the oldest profession.
How do we discourage this "taxing" behavior so ingrained in modern society? The usual approach is to put a tax on it, but that's rather meta and seems unlikely to improve the situation.
🌻🌻 [google.com]
(Score: 0) by Anonymous Coward on Saturday July 06, @06:19PM (1 child)
People have been paying for it for a while.
(Score: 2) by JoeMerchant on Sunday July 07, @12:27AM
The hype says it's a hockey stick curve and we are near the inflection point. Banks and postal services were paying for OCR automation 40-50 years ago, but the overall market was tiny compared to what is apparently coming.
🌻🌻 [google.com]
(Score: 2) by corey on Saturday July 06, @11:28PM
I don’t know if I’d call it taxing, it’s just capitalism doing its thing, fueled by typical ingrained human selfishness.
Agree though, now all the megacorps are clamouring to invest in LLMs so they can monetise them then sit back, pay off the investments followed by profit. It’s all hard to watch. I like the technologies are used to help humanity but we seem to live in a time that new technologies get sucked up by megacorps as soon as try become viable, then they use them to abuse people or ruin humanity slightly.
(Score: 5, Insightful) by Ox0000 on Saturday July 06, @11:53AM (4 children)
I see... when it was your or mine content that was being scraped, everything was hunky-dory, but now that '
real moneyunrealized profit' is on the line, now things are serious. Ripping off the rubes is fine, "needed for progress" even, but trying to steal things from robber barons, well that's serious and must be stopped!The current set of hammers that are being called AI is unsustainable, unfit for purpose, and undesirable. Get rid of the lot!
"Generative AI" delenda est!
(Score: 5, Insightful) by JoeMerchant on Saturday July 06, @01:07PM (3 children)
It's our court system. The Robber Barons can fight back in the courts with a wave of their hand to their retained lawyers. Average folks need to sell their house just to start to mount a serious legal challenge.
🌻🌻 [google.com]
(Score: 5, Informative) by Thexalon on Saturday July 06, @07:20PM (2 children)
And I'll just add that in one of the more consequential SCOTUS decisions you've probably never heard of called AT&T v Concepcion, the megacorps are empowered to put clauses into their EULAs, employment agreements, and other such contracts that you pretty much have to agree to in order to function that make it impossible to sue them for any reason. Instead, you are forced into binding arbitration, where the megacorp gets to pick the arbiter, and sends a company-paid lawyer whereas you are either representing yourself or paying more for legal representation than the dispute is probably worth. And don't think of joining a class action because you and 30,000 other people were affected by the same misconduct, you signed that away too, instead each of you has to bring your case individually. Oh, and they can do this even if your state passed a law saying that they can't do this.
And you don't even have to agree to it when you initially signed up, because of another clause in these kinds of contracts that makes it so the megacorp can change the contract however they like by sending you a single notice somewhere, and your only recourse is to stop doing business with them right then and there.
This is why when people suggest that civil litigation makes the FTC and similar regulatory bodies unnecessary, I can't help but think they're either hopelessly naive or willfully lying.
The only thing that stops a bad guy with a compiler is a good guy with a compiler.
(Score: 2) by JoeMerchant on Saturday July 06, @08:05PM
>paying more for legal representation than the dispute is probably worth.
And an average individual is faced with this once or twice in a lifetime, but a corp like AT&T can repeat the scam millions of times...
AT&T and I got crossed up over a $35 bill that I disagreed with, they thought that was worth "going to the mattresses" for: collection agency calling me at work, black mark (unpaid bill, amount unspecified) on my credit reports... Every time it came up I got to laugh at them with whoever was supposed to think poorly of me for being a "deadbeat". Thing was: in the past 30 years since then I never had another bad mark on my credit report, just them, who do you think is the bad actor in this case?
🌻🌻 [google.com]
(Score: 3, Insightful) by Ox0000 on Saturday July 06, @09:38PM
But... but... but... contracts are something that you enter entirely out of your own volition. If you don't like the terms, you should not enter the contract and people who do are dumb. And if you do and get screwed, that's totally your own fault!
</sarcasm> for those frequenters of this site who are libertar^W still on the path of finding out how the world really works.
(Score: 2) by looorg on Saturday July 06, @01:57PM
Didn't Microsoft just days ago claim stuff on the interwebz was free? That said perhaps they are running out of quality data to train their AI with. They have run out of free cruddy data. The Shit in -- Shit out phase is over and it's not producing desired results as far as their precious AI is concerned? Or naturally all the content producers, aka the free training material producers, are getting wise and all now want a piece of the pie. You apparently get what you pay for.
(Score: 1, Interesting) by Anonymous Coward on Saturday July 06, @08:07PM (1 child)
Since this has started, I've noticed huge spikes in transfer bandwidth over and over again to places
I've never heard of before transfering almost 2tb of data per day to the same places over and over
again.
(Score: 2) by acid andy on Saturday July 06, @10:18PM
Data transfering from where? Browser Javascript on websites you visit? Or the OS?
Consumerism is poison.
(Score: 2) by oumuamua on Sunday July 07, @04:48PM (1 child)
Everyone thinks their data is so valuable when in fact even someone like the NYTimes is 0.0000001% of the training set. Everyone is grabbing for a piece of the pie. Let's hope capitalism and this scarcity mind-set does not get baked into AGI when it emerges: https://www.genolve.com/design/socialmedia/memes/writers-artists-want-compensation-for-training-the-LLM [genolve.com]
(Score: 2) by Ox0000 on Monday July 08, @08:23PM
Your argument seems to come down to: "It's fine for corporations to steal .0001 cents from every individual because it's only a small amount of money that each person loses; who cares if their cumulative haul is in the hundreds of billions of ill-gotten gains".
That doesn't seem right to me as it excuses theft. Theft is still theft. And that's what these LLM companies are doing: thievery.
(Score: 3, Interesting) by Dale on Monday July 08, @01:22PM
I think the solution to this overall issue is already out there. We have copyright for a reason. It is there so that after a limited period of time things enter the public domain for the use and benefit of all. If AI training isn't the poster child for what the public domain is supposed to be for then I don't know what is. Let that be the line on what can and cannot be used. It would also have the side benefit of us getting to watch the tech industry go to war with the copyright industry and maybe we get copyrights back down to something reasonable. In all the stories on this I have been shocked that I haven't seen this very recommendation or point discussed.