The European Union is writing legislation that would hold accountable companies that create generative AI platforms:
A proposed set of rules by the European Union would, among other things. require makers of generative AI tools such as ChatGPT,to publicize any copyrighted material used by the technology platforms to create content of any kind.
A new draft of European Parliament's legislation, a copy of which was attained by The Wall Street Journal, would allow the original creators of content used by generative AI applications to share in any profits that result.
The European Union's "Artificial Intelligence Act" (AI Act) is the first of its kind by a western set of nations. The proposed legislation relies heavily on existing rules, such as the General Data Protection Regulation (GDPR), the Digital Services Act, and the Digital Markets Act. The AI Act was originally proposed by the European Commission in April 2021.
The bill's provisions also require that the large language models (LLMs) behind generative AI tech, such as the GPT-4, be designed with adequate safeguards against generating content that violates EU laws; that could include child pornography or, in some EU countries, denial of the Holocaust, according to The Washington Post.
[...] But the solution to keeping AI honest isn't easy, according to Avivah Litan, a vice president and distinguished analyst at Gartner Research. It's likely that LLM creators, such as San Fransisco-based OpenAI and others, will need to develop powerful LLMs to check that the ones trained initially have no copyrighted materials. Rules-based systems to filter out copyright materials are likely to be ineffective, Liten said.
[...] Regulators should consider that LLMs are effectively operating as a black box, she said, and it's unlikely that the algorithms will provide organizations with the needed transparency to conduct the requisite privacy impact assessment. "This must be addressed," Litan said.
"It's interesting to note that at one point the AI Act was going to exclude oversight of Generative AI models, but they were included later," Litan said "Regulators generally want to move carefully and methodically so that they don't stifle innovation and so that they create long-lasting rules that help achieve the goals of protecting societies without being overly prescriptive in the means."
[...] "The US and the EU are aligned in concepts when it comes to wanting to achieve trustworthy, transparent, and fair AI, but their approaches have been very different," Litan said.
So far, the US has taken what Litan called a "very distributed approach to AI risk management," and it has yet to create new regulations or regulatory infrastructure. The US has focused on guidelines and an AI Risk Management framework.
[...] Key to the EU's AI Act is a classification system that determines the level of risk an AI technology could pose to the health and safety or fundamental rights of a person. The framework includes four risk tiers: unacceptable, high, limited, and minimal, according to the World Economic Forum.
[...] While AI has been around for decades, it has "reached new capacities fueled by computing power," Thierry Breton, the EU's Commissioner for Internal Market, said in a statement in 2021. The Artificial Intelligence Act, he said, was created to ensure that "AI in Europe respects our values and rules, and harness the potential of AI for industrial use."
But out of 300,000 high-probability images tested, researchers found a 0.03% memorization rate:
On Monday, a group of AI researchers from Google, DeepMind, UC Berkeley, Princeton, and ETH Zurich released a paper outlining an adversarial attack that can extract a small percentage of training images from latent diffusion AI image synthesis models like Stable Diffusion. It challenges views that image synthesis models do not memorize their training data and that training data might remain private if not disclosed.
Recently, AI image synthesis models have been the subject of intense ethical debate and even legal action. Proponents and opponents of generative AI tools regularly argue over the privacy and copyright implications of these new technologies. Adding fuel to either side of the argument could dramatically affect potential legal regulation of the technology, and as a result, this latest paper, authored by Nicholas Carlini et al., has perked up ears in AI circles.
The AI software Stable Diffusion has a remarkable ability to turn text into images. When I asked the software to draw "Mickey Mouse in front of a McDonald's sign," for example, it generated the picture you see above.
Stable Diffusion can do this because it was trained on hundreds of millions of example images harvested from across the web. Some of these images were in the public domain or had been published under permissive licenses such as Creative Commons. Many others were not—and the world's artists and photographers aren't happy about it.
In January, three visual artists filed a class-action copyright lawsuit against Stability AI, the startup that created Stable Diffusion. In February, the image-licensing giant Getty filed a lawsuit of its own.
The plaintiffs in the class-action lawsuit describe Stable Diffusion as a "complex collage tool" that contains "compressed copies" of its training images. If this were true, the case would be a slam dunk for the plaintiffs.
But experts say it's not true. Erik Wallace, a computer scientist at the University of California, Berkeley, told me in a phone interview that the lawsuit had "technical inaccuracies" and was "stretching the truth a lot." Wallace pointed out that Stable Diffusion is only a few gigabytes in size—far too small to contain compressed copies of all or even very many of its training images.
Bad news: copyright industry attacks on the Internet's plumbing are increasing – and succeeding:
Back in October 2021, Walled Culture wrote about a ruling from a US judge. It concerned an attempt to make the content delivery network (CDN) Cloudflare, which is simply part of the Internet's plumbing, responsible for what flows through its connections. The judge rightly decided: "a reasonable jury could not – at least on this record – conclude that Cloudflare materially contributes to the underlying copyright infringement".
A similar case in Germany was brought by Sony Music against the free, recursive, anycast DNS platform Quad9. Like CDNs, DNS platforms are crucial services that ensure that the Internet can function smoothly; they are not involved with any of the sites that may be accessed as a result of their services. In particular, they have no knowledge of whether copyright material on those sites is authorised or not. Unfortunately, two regional courts in Germany don't seem to understand that point, and have issued judgments against Quad9. Its FAQ on one of the cases explains why this is a dreadful result for the entire Internet:
The court argues with the German law principle of "interferer liability" the so-called "Stoererhaftung", which allows holding uninvolved third parties liable for an infringement if they have in some way adequately and causally contributed to the infringement of a protected legal interest. If DNS resolvers can be held liable as interferers, this would set a dangerous precedent for all services used in retrieving web pages. Providers of browsers, operating systems or antivirus software could be held liable as interferers on the same grounds if they do not prevent the accessibility of copyright-infringing websites.
Now an Italian court has confirmed a previous ruling that Cloudflare must block certain online sites accused of making available unauthorised copies of material. That's unfortunate, since taken with the German court rulings it is likely to encourage the copyright industry to widen its attack on the Internet's plumbing, regardless of the wider harm this is likely to cause.
Inside the secret list of websites that make AI like ChatGPT sound smart:
AI chatbots have exploded in popularity over the past four months, stunning the public with their awesome abilities, from writing sophisticated term papers to holding unnervingly lucid conversations.
Chatbots cannot think like humans: They do not actually understand what they say. They can mimic human speech because the artificial intelligence that powers them has ingested a gargantuan amount of text, mostly scraped from the internet.
This text is the AI's mainsource of information about the world as it is being built, and it influences how it responds to users. If it aces the bar exam, for example, it's probably because its training data included thousands of LSAT practice sites.
Tech companies have grown secretive about what they feed the AI. So The Washington Post set out to analyze one of these data sets to fully reveal the types of proprietary, personal, and often offensive websites that go into an AI's training data.
To look inside this black box, we analyzed Google's C4 data set, a massive snapshot of the contents of 15 million websites that have been used to instruct some high-profile English-language AIs, called large language models, including Google's T5 and Facebook's LLaMA. (OpenAI does not disclose what datasets it uses to train the models backing its popular chatbot, ChatGPT)
The Post worked with researchers at the Allen Institute for AI on this investigation and categorized the websites using data from Similarweb, a web analytics company. About a third of the websites could not be categorized, mostly because they no longer appear on the internet. Those are not shown.
We then ranked the remaining 10 million websites based on how many "tokens" appeared from each in the data set. Tokens are small bits of text used to process disorganized information — typically a word or phrase.
Yet again, the copyright industry demands to be shielded from technological progress – and the future:
Back in October last year, Walled Culture was one of the first blogs to point out the huge impact that generative AI would have not only on copyright but also on creativity itself. Since then, the world seems to have split into two camps. One believes that generative AI will revolutionise everything, and create some kind of golden age; and the other that thinks the whole thing is a complete sham and/or will destroy civilisation.
The new AI systems certainly have massive problems, not least in the sphere of privacy, as I have written about elsewhere. But the response by the copyright world to generative AI is increasingly extreme, rather as a Walled Culture post back in February warned it might be. The latest manifestation of that tendency is a "Call for Safeguards Around Generative AI in the European AI Act" from "over 40 associations and trade unions that joined the Authors' Rights Initiative". It is a typical anti-technology, anti-progress set of demands from the copyright industry. Its signatories "demand" regulation of generative AI, and they demand it "NOW" (sic).
The document throws in just about every recent criticism of generative AI, some of them undoubtedly quite justified. But those criticisms are largely beside the point, because the letter is really about one thing: copyright, and shielding it from the latest technological advances. [...]
[...] the new document has an entire section devoted to what it calls "The EU's misguided text-and-data mining exemption". Part of it tries to address the argument (made by this blog too) that "use of copyright protected material to train generative AI should be permissible because such training would be equivalent to the (lawful) use of works to get 'inspired'":