AI Companies Are Finally Being Forced To Cough Up For Training Data

posted by janrinok on Saturday July 06 2024, @09:08AM

Arthur T Knackerbracket has processed the following story:

The music industry’s lawsuit sends the loudest message yet: High-quality training data is not free.
The generative AI boom is built on scale. The more training data, the more powerful the model.
But there’s a problem. AI companies have pillaged the internet for training data, and many websites and data set owners have started restricting the ability to scrape their websites. We’ve also seen a backlash against the AI sector’s practice of indiscriminately scraping online data, in the form of users opting out of making their data available for training and lawsuits from artists, writers, and the New York Times, claiming that AI companies have taken their intellectual property without consent or compensation.
Last week three major record labels—Sony Music, Warner Music Group, and Universal Music Group—announced they were suing the AI music companies Suno and Udio over alleged copyright infringement. The music labels claim the companies made use of copyrighted music in their training data “at an almost unimaginable scale,” allowing the AI models to generate songs that “imitate the qualities of genuine human sound recordings.
But this moment also sets an interesting precedent for all of generative AI development. Thanks to the scarcity of high-quality data and the immense pressure and demand to build even bigger and better models, we’re in a rare moment where data owners actually have some leverage. The music industry’s lawsuit sends the loudest message yet: High-quality training data is not free.
It will likely take a few years at least before we have legal clarity around copyright law, fair use, and AI training data. But the cases are already ushering in changes. OpenAI has been striking deals with news publishers such as Politico, the Atlantic, Time, the Financial Times, and others, and exchanging publishers’ news archives for money and citations. And YouTube announced in late June that it will offer licensing deals to top record labels in exchange for music for training.

These changes are a mixed bag. On one hand, I’m concerned that news publishers are making a Faustian bargain with AI. For example, most of the media houses that have made deals with OpenAI say the deal stipulates that OpenAI cite its sources. But language models are fundamentally incapable of being factual and are best at making things up. Reports have shown that ChatGPT and the AI-powered search engine Perplexity frequently hallucinate citations, which makes it hard for OpenAI to honor its promises.
It’s tricky for AI companies too. This shift could lead to them build smaller, more efficient models, which are far less polluting. Or they may fork out a fortune to access data at the scale they need to build the next big one. Only the companies most flush with cash, and/or with large existing data sets of their own (such as Meta, with its two decades of social media data), can afford to do that. So the latest developments risk concentrating power even further into the hands of the biggest players.
On the other hand, the idea of introducing consent into this process is a good one—not just for rights holders, who can benefit from the AI boom, but for all of us. We should all have the agency to decide how our data is used, and a fairer data economy would mean we could all benefit.

Original Submission

This discussion was created by janrinok (52) for logged-in users only, but now has been archived. No new comments can be posted.

AI Companies Are Finally Being Forced To Cough Up For Training Data | Log In/Create an Account | Top | 15 comments | Search Discussion

The Fine Print: The following comments are owned by whoever posted them. We are not responsible for them in any way.

SoylentNews

SoylentNews is people

Navigation

Sections

SoylentNews

Log In

AI Companies Are Finally Being Forced To Cough Up For Training Data

Related Stories

And nobody cared until...And nobody cared until... (Score: 3, Insightful) by JoeMerchant on Saturday July 06 2024, @11:50AM (3 children)

Nonsense.Nonsense. (Score: 0) by Anonymous Coward on Saturday July 06 2024, @06:19PM (1 child)

Re:Nonsense.(Score: 2) by JoeMerchant on Sunday July 07 2024, @12:27AM

Re:And nobody cared until...(Score: 2) by corey on Saturday July 06 2024, @11:28PM

Some are more equal under the lawSome are more equal under the law (Score: 5, Insightful) by Ox0000 on Saturday July 06 2024, @11:53AM (4 children)

Re:Some are more equal under the lawRe:Some are more equal under the law (Score: 5, Insightful) by JoeMerchant on Saturday July 06 2024, @01:07PM (3 children)

Re:Some are more equal under the lawRe:Some are more equal under the law (Score: 5, Informative) by Thexalon on Saturday July 06 2024, @07:20PM (2 children)

Re:Some are more equal under the law(Score: 2) by JoeMerchant on Saturday July 06 2024, @08:05PM

Re:Some are more equal under the law(Score: 3, Insightful) by Ox0000 on Saturday July 06 2024, @09:38PM

You get what you pay for.(Score: 2) by looorg on Saturday July 06 2024, @01:57PM

Start charging the fuckers for the bandwidthStart charging the fuckers for the bandwidth (Score: 1, Interesting) by Anonymous Coward on Saturday July 06 2024, @08:07PM (1 child)

Re:Start charging the fuckers for the bandwidth(Score: 2) by acid andy on Saturday July 06 2024, @10:18PM

scarcity-based thinkingscarcity-based thinking (Score: 2) by oumuamua on Sunday July 07 2024, @04:48PM (1 child)

Re:scarcity-based thinking(Score: 2) by Ox0000 on Monday July 08 2024, @08:23PM

This is the public benefit(Score: 3, Interesting) by Dale on Monday July 08 2024, @01:22PM

SoylentNews

SoylentNews is people

Navigation

Sections

SoylentNews

Log In

Related Links

AI Companies Are Finally Being Forced To Cough Up For Training Data

Related Stories

And nobody cared until...And nobody cared until... (Score: 3, Insightful) by JoeMerchant on Saturday July 06 2024, @11:50AM (3 children)

Nonsense.Nonsense. (Score: 0) by Anonymous Coward on Saturday July 06 2024, @06:19PM (1 child)

Re:Nonsense.(Score: 2) by JoeMerchant on Sunday July 07 2024, @12:27AM

Re:And nobody cared until...(Score: 2) by corey on Saturday July 06 2024, @11:28PM

Some are more equal under the lawSome are more equal under the law (Score: 5, Insightful) by Ox0000 on Saturday July 06 2024, @11:53AM (4 children)

Re:Some are more equal under the lawRe:Some are more equal under the law (Score: 5, Insightful) by JoeMerchant on Saturday July 06 2024, @01:07PM (3 children)

Re:Some are more equal under the lawRe:Some are more equal under the law (Score: 5, Informative) by Thexalon on Saturday July 06 2024, @07:20PM (2 children)

Re:Some are more equal under the law(Score: 2) by JoeMerchant on Saturday July 06 2024, @08:05PM

Re:Some are more equal under the law(Score: 3, Insightful) by Ox0000 on Saturday July 06 2024, @09:38PM

You get what you pay for.(Score: 2) by looorg on Saturday July 06 2024, @01:57PM

Start charging the fuckers for the bandwidthStart charging the fuckers for the bandwidth (Score: 1, Interesting) by Anonymous Coward on Saturday July 06 2024, @08:07PM (1 child)

Re:Start charging the fuckers for the bandwidth(Score: 2) by acid andy on Saturday July 06 2024, @10:18PM

scarcity-based thinkingscarcity-based thinking (Score: 2) by oumuamua on Sunday July 07 2024, @04:48PM (1 child)

Re:scarcity-based thinking(Score: 2) by Ox0000 on Monday July 08 2024, @08:23PM

This is the public benefit(Score: 3, Interesting) by Dale on Monday July 08 2024, @01:22PM