A proposed set of rules by the European Union would, among other things. require makers of generative AI tools such as ChatGPT,to publicize any copyrighted material used by the technology platforms to create content of any kind.
A new draft of European Parliament's legislation, a copy of which was attained by The Wall Street Journal, would allow the original creators of content used by generative AI applications to share in any profits that result.
The European Union's "Artificial Intelligence Act" (AI Act) is the first of its kind by a western set of nations. The proposed legislation relies heavily on existing rules, such as the General Data Protection Regulation (GDPR), the Digital Services Act, and the Digital Markets Act. The AI Act was originally proposed by the European Commission in April 2021.
The bill's provisions also require that the large language models (LLMs) behind generative AI tech, such as the GPT-4, be designed with adequate safeguards against generating content that violates EU laws; that could include child pornography or, in some EU countries, denial of the Holocaust, according to The Washington Post.
[...] But the solution to keeping AI honest isn't easy, according to Avivah Litan, a vice president and distinguished analyst at Gartner Research. It's likely that LLM creators, such as San Fransisco-based OpenAI and others, will need to develop powerful LLMs to check that the ones trained initially have no copyrighted materials. Rules-based systems to filter out copyright materials are likely to be ineffective, Liten said.
[...] Regulators should consider that LLMs are effectively operating as a black box, she said, and it's unlikely that the algorithms will provide organizations with the needed transparency to conduct the requisite privacy impact assessment. "This must be addressed," Litan said.
"It's interesting to note that at one point the AI Act was going to exclude oversight of Generative AI models, but they were included later," Litan said "Regulators generally want to move carefully and methodically so that they don't stifle innovation and so that they create long-lasting rules that help achieve the goals of protecting societies without being overly prescriptive in the means."
[...] "The US and the EU are aligned in concepts when it comes to wanting to achieve trustworthy, transparent, and fair AI, but their approaches have been very different," Litan said.
So far, the US has taken what Litan called a "very distributed approach to AI risk management," and it has yet to create new regulations or regulatory infrastructure. The US has focused on guidelines and an AI Risk Management framework.
[...] Key to the EU's AI Act is a classification system that determines the level of risk an AI technology could pose to the health and safety or fundamental rights of a person. The framework includes four risk tiers: unacceptable, high, limited, and minimal, according to the World Economic Forum.
[...] While AI has been around for decades, it has "reached new capacities fueled by computing power," Thierry Breton, the EU's Commissioner for Internal Market, said in a statement in 2021. The Artificial Intelligence Act, he said, was created to ensure that "AI in Europe respects our values and rules, and harness the potential of AI for industrial use."
Related:
Yet Again, the Copyright Industry Demands to be Shielded From Technological Progress
Inside the Secret List of Websites That Make AI Like ChatGPT Sound Smart
Bad News: Copyright Industry Attacks on the Internet's Plumbing are Increasing – and Succeeding
Stable Diffusion Copyright Lawsuits Could be a Legal Earthquake for AI
Paper: Stable Diffusion "Memorizes" Some Images, Sparking Privacy Concerns
« SpaceX's Starship Didn't Immediately Respond to a Self-Destruct Command | Python 3.12: Faster, Leaner, More Future-proof »
Related Stories
But out of 300,000 high-probability images tested, researchers found a 0.03% memorization rate:
On Monday, a group of AI researchers from Google, DeepMind, UC Berkeley, Princeton, and ETH Zurich released a paper outlining an adversarial attack that can extract a small percentage of training images from latent diffusion AI image synthesis models like Stable Diffusion. It challenges views that image synthesis models do not memorize their training data and that training data might remain private if not disclosed.
Recently, AI image synthesis models have been the subject of intense ethical debate and even legal action. Proponents and opponents of generative AI tools regularly argue over the privacy and copyright implications of these new technologies. Adding fuel to either side of the argument could dramatically affect potential legal regulation of the technology, and as a result, this latest paper, authored by Nicholas Carlini et al., has perked up ears in AI circles.
Related:
Getty Images Targets AI Firm For 'Copying' Photos
The AI software Stable Diffusion has a remarkable ability to turn text into images. When I asked the software to draw "Mickey Mouse in front of a McDonald's sign," for example, it generated the picture you see above.
Stable Diffusion can do this because it was trained on hundreds of millions of example images harvested from across the web. Some of these images were in the public domain or had been published under permissive licenses such as Creative Commons. Many others were not—and the world's artists and photographers aren't happy about it.
In January, three visual artists filed a class-action copyright lawsuit against Stability AI, the startup that created Stable Diffusion. In February, the image-licensing giant Getty filed a lawsuit of its own.
[...]
The plaintiffs in the class-action lawsuit describe Stable Diffusion as a "complex collage tool" that contains "compressed copies" of its training images. If this were true, the case would be a slam dunk for the plaintiffs.But experts say it's not true. Erik Wallace, a computer scientist at the University of California, Berkeley, told me in a phone interview that the lawsuit had "technical inaccuracies" and was "stretching the truth a lot." Wallace pointed out that Stable Diffusion is only a few gigabytes in size—far too small to contain compressed copies of all or even very many of its training images.
Related:
Ethical AI art generation? Adobe Firefly may be the answer. (20230324)
Paper: Stable Diffusion "Memorizes" Some Images, Sparking Privacy Concerns (20230206)
Getty Images Targets AI Firm For 'Copying' Photos (20230117)
Pixel Art Comes to Life: Fan Upgrades Classic MS-DOS Games With AI (20220904)
A Startup Wants to Democratize the Tech Behind DALL-E 2, Consequences be Damned (20220817)
Bad news: copyright industry attacks on the Internet's plumbing are increasing – and succeeding:
Back in October 2021, Walled Culture wrote about a ruling from a US judge. It concerned an attempt to make the content delivery network (CDN) Cloudflare, which is simply part of the Internet's plumbing, responsible for what flows through its connections. The judge rightly decided: "a reasonable jury could not – at least on this record – conclude that Cloudflare materially contributes to the underlying copyright infringement".
A similar case in Germany was brought by Sony Music against the free, recursive, anycast DNS platform Quad9. Like CDNs, DNS platforms are crucial services that ensure that the Internet can function smoothly; they are not involved with any of the sites that may be accessed as a result of their services. In particular, they have no knowledge of whether copyright material on those sites is authorised or not. Unfortunately, two regional courts in Germany don't seem to understand that point, and have issued judgments against Quad9. Its FAQ on one of the cases explains why this is a dreadful result for the entire Internet:
The court argues with the German law principle of "interferer liability" the so-called "Stoererhaftung", which allows holding uninvolved third parties liable for an infringement if they have in some way adequately and causally contributed to the infringement of a protected legal interest. If DNS resolvers can be held liable as interferers, this would set a dangerous precedent for all services used in retrieving web pages. Providers of browsers, operating systems or antivirus software could be held liable as interferers on the same grounds if they do not prevent the accessibility of copyright-infringing websites.
Now an Italian court has confirmed a previous ruling that Cloudflare must block certain online sites accused of making available unauthorised copies of material. That's unfortunate, since taken with the German court rulings it is likely to encourage the copyright industry to widen its attack on the Internet's plumbing, regardless of the wider harm this is likely to cause.
Inside the secret list of websites that make AI like ChatGPT sound smart:
AI chatbots have exploded in popularity over the past four months, stunning the public with their awesome abilities, from writing sophisticated term papers to holding unnervingly lucid conversations.
Chatbots cannot think like humans: They do not actually understand what they say. They can mimic human speech because the artificial intelligence that powers them has ingested a gargantuan amount of text, mostly scraped from the internet.
This text is the AI's mainsource of information about the world as it is being built, and it influences how it responds to users. If it aces the bar exam, for example, it's probably because its training data included thousands of LSAT practice sites.
Tech companies have grown secretive about what they feed the AI. So The Washington Post set out to analyze one of these data sets to fully reveal the types of proprietary, personal, and often offensive websites that go into an AI's training data.
To look inside this black box, we analyzed Google's C4 data set, a massive snapshot of the contents of 15 million websites that have been used to instruct some high-profile English-language AIs, called large language models, including Google's T5 and Facebook's LLaMA. (OpenAI does not disclose what datasets it uses to train the models backing its popular chatbot, ChatGPT)
The Post worked with researchers at the Allen Institute for AI on this investigation and categorized the websites using data from Similarweb, a web analytics company. About a third of the websites could not be categorized, mostly because they no longer appear on the internet. Those are not shown.
We then ranked the remaining 10 million websites based on how many "tokens" appeared from each in the data set. Tokens are small bits of text used to process disorganized information — typically a word or phrase.
Back in October last year, Walled Culture was one of the first blogs to point out the huge impact that generative AI would have not only on copyright but also on creativity itself. Since then, the world seems to have split into two camps. One believes that generative AI will revolutionise everything, and create some kind of golden age; and the other that thinks the whole thing is a complete sham and/or will destroy civilisation.
The new AI systems certainly have massive problems, not least in the sphere of privacy, as I have written about elsewhere. But the response by the copyright world to generative AI is increasingly extreme, rather as a Walled Culture post back in February warned it might be. The latest manifestation of that tendency is a "Call for Safeguards Around Generative AI in the European AI Act" from "over 40 associations and trade unions that joined the Authors' Rights Initiative". It is a typical anti-technology, anti-progress set of demands from the copyright industry. Its signatories "demand" regulation of generative AI, and they demand it "NOW" (sic).
The document throws in just about every recent criticism of generative AI, some of them undoubtedly quite justified. But those criticisms are largely beside the point, because the letter is really about one thing: copyright, and shielding it from the latest technological advances. [...]
[...] the new document has an entire section devoted to what it calls "The EU's misguided text-and-data mining exemption". Part of it tries to address the argument (made by this blog too) that "use of copyright protected material to train generative AI should be permissible because such training would be equivalent to the (lawful) use of works to get 'inspired'":
(Score: 4, Interesting) by Barenflimski on Wednesday May 03, @01:42PM (1 child)
The first regulation is with regards to copyrights? Google has been using their algorithms to scrape the web for two decades. Now these bots can't do the same?
I think its pretty clear. These folks could care little about watching the world burn as long as they all make a crap ton of money. The only reason one would do this is to lock in the big players and slow down competition.
I have zero problems with these AI bots so far. All they do is regurgitate what they've been trained on. If one doesn't place these things on a pedestal, treating them like all knowing gods, I think we're all fine.
What worries me is this instant push by the talkers about how these things are sentient, smart and 'like humans but without the flaws.' It seems to me that all these folks would rather trust a bot than their fellow human. It's like they've drank the same kool-aid they've been spewing themselves about how terrible everything and everyone else is. While the news is bad, the people I meet on a daily basis are kind, witty, fun, positive and don't short circuit when having a beer.
If these lawmakers gave half a shit, they'd create regulations around pairing these things with robots that actually DO something. Maybe they could even reign in the people that continually gaslight the world?
(Score: 3, Insightful) by DeathMonkey on Wednesday May 03, @06:55PM
Google already does publish the list of material used because they link you to the site it's on. And I would generally consider it fair use because it's a snippet used in furtherance of describing the content at the link.
As copyright law stands now these chat bots should probably be fully disallowed from using any copyrighted materials on the internet to train their model because that is then creating a derivative work.
So this sounds likes a compulsory licensing scheme like ASCAP to allow some public usage of the data in exchange for a share of any proceeds.
Sounds like a pretty good idea to me, actually.
(Score: 4, Insightful) by hendrikboom on Wednesday May 03, @02:11PM (4 children)
Set up filters on their output? Nonsense. All they need to do is just not train them on copyrighted content -- unless they have a license to use it and respect the license's terms.
There is a lot of copyright-free material available. Just download project Gutenberg, for example.
Some newspapers would likely be happy to accept appropriate payments for the use of their archives.
(Score: 2) by RamiK on Wednesday May 03, @02:55PM (3 children)
You can't stop people from downloading SoftVC VITS [github.com] and training it on music recordings on their own and release this [youtube.com] or that [youtube.com].
compiling...
(Score: 2) by ilsa on Wednesday May 03, @03:19PM
No you can't, however those people are typically doing it for personal use so the scope of impact is miniscule compared to for-profit corporations making money from other people's stuff without attribution or compensation. So much so that it falls under completely different laws (as I understand it).
If however they are trying to monetize what they produce... I don't know enough about the grey areas of copyright and contract law to say where the lines need to be drawn.
(Score: 2) by hendrikboom on Wednesday May 03, @04:51PM (1 child)
That's a matter of enforcement, not legality.
It's probably easier to enforce on large projects than on home recording.
(Score: 2) by RamiK on Wednesday May 03, @05:34PM
The very point of the proposed legislation is to require software vendors to enforce copyrights instead of going after the actual violators (the people who make use of the AI to generate content and the platforms that distribute said content).
How are you going to prevent any sized AI project from stripping meta-data? This is all gatekeeping to raise the costs of LLMs and lawyers writing billable-hours friendly rules.
compiling...
(Score: 5, Interesting) by DannyB on Wednesday May 03, @02:53PM
Copyright, especially practically unlimited length copyright, seems fundamentally incompatible with open and free access to information, archival, compilations of historical material and recordings, access to research, and many other things I could go on about.
Copyright is used to take down videos simply because one participant makes sure to play a bit of music to ensure that a recording of their abhorrent actions can be taken down using DMCA mechanisms.
The DMCA was a travesty when it was first proposed. Now we are just accustomed to it. DMCA is frequently used to suppress speech. People routinely and FALSELY swear under penalty of perjury that they are the copyright agent and represent the copyright owner and that this DMCA take down is over copyright infringement.
I'm not against someone being able to profit from the investment in creating a work covered by copyright. But things have gotten way WAY out of hand.
The fact that copy protection exists all the way along the digital chain to your HDMI connector on your TV set, and then into the TV itself, should be eye opening.
Maybe we should not be trying to limit crawling the web, but instead trying to reign in the ever expanding reach of copyright.
The Internet Archive may have to shut down. Digital works that you own do not seem to have the first sale exhaustion that physical products enjoy. I can't apparently sell you my mp3 collection, including me destroying all my copies.
Let us not be reigning in AI. Let us reign in Copyright. The only thing we need to reign in about AI is its misuse and possible dangers it can create. But not the fact that it may have been taught on materials covered by copyright. You and I were educated and have filled our brains with vast amounts of materials covered by copyright.
How often should I have my memory checked? I used to know but...
(Score: 2) by legont on Thursday May 04, @05:13AM
Places that don't respect copyright - or Chinese firewall for that matter - will end up with orders of magnitude smarter AIs and will win everything from education to economy to military.
So, you either let freedom be or die. I hope I'll still have time to enjoy the show.
"Wealth is the relentless enemy of understanding" - John Kenneth Galbraith.