Bruce Schneir, the cryptographer and privacy specialist, and J B Branch, an accountability advocate at Public Citizen, have written a post about AI and the Corporate Capture of Knowledge. They raise hard questions about what happened to Aaron Swartz in the context of the what is going on with artificial intelligence, copyright, and ultimately the control of knowledge:
As AI becomes a larger part of America's economy, one can see the writing on the wall. Judges will twist themselves into knots to justify an innovative technology premised on literally stealing the works of artists, poets, musicians, all of academia and the internet, and vast expanses of literature. But if Swartz's actions were criminal, it is worth asking: What standard are we now applying to AI companies?
The question is not simply whether copyright law applies to AI. It is why the law appears to operate so differently depending on who is doing the extracting and for what purpose.
The stakes extend beyond copyright law or past injustices. They concern who controls the infrastructure of knowledge going forward and what that control means for democratic participation, accountability and public trust.
The questions they raise are important questions because the foundation of democracy is being able to make informed decisions through participation in civilized, well-rounded discussions. The prerequisite for that is knowledge.
(2025) OpenAI Desperate to Avoid Explaining Why It Deleted Pirated Book Datasets
(2025) Meta Pirated and Seeded Porn for Years to Train AI, Lawsuit Says
(2025) Creating AI Based Entirely on Ethically-Sourced Data
(2025) Copyright Office Thinks AI Companies Sometimes Stole Content
(2024) OpenAI Whistleblower Found Dead in San Francisco Apartment
(2024) OpenAI Blamed NYT for Tech Problem Erasing Evidence of Copyright Abuse
(2024) AI Companies Are Finally Being Forced To Cough Up For Training Data
The music industry’s lawsuit sends the loudest message yet: High-quality training data is not free.
The generative AI boom is built on scale. The more training data, the more powerful the model.
But there’s a problem. AI companies have pillaged the internet for training data, and many websites and data set owners have started restricting the ability to scrape their websites. We’ve also seen a backlash against the AI sector’s practice of indiscriminately scraping online data, in the form of users opting out of making their data available for training and lawsuits from artists, writers, and the New York Times, claiming that AI companies have taken their intellectual property without consent or compensation.
Last week three major record labels—Sony Music, Warner Music Group, and Universal Music Group—announced they were suing the AI music companies Suno and Udio over alleged copyright infringement. The music labels claim the companies made use of copyrighted music in their training data “at an almost unimaginable scale,” allowing the AI models to generate songs that “imitate the qualities of genuine human sound recordings.
But this moment also sets an interesting precedent for all of generative AI development. Thanks to the scarcity of high-quality data and the immense pressure and demand to build even bigger and better models, we’re in a rare moment where data owners actually have some leverage. The music industry’s lawsuit sends the loudest message yet: High-quality training data is not free.
It will likely take a few years at least before we have legal clarity around copyright law, fair use, and AI training data. But the cases are already ushering in changes. OpenAI has been striking deals with news publishers such as Politico, the Atlantic, Time, the Financial Times, and others, and exchanging publishers’ news archives for money and citations. And YouTube announced in late June that it will offer licensing deals to top record labels in exchange for music for training.
OpenAI keeps deleting data that could allegedly prove the AI company violated copyright laws by training ChatGPT on authors' works. Apparently largely unintentional, the sloppy practice is seemingly dragging out early court battles that could determine whether AI training is fair use.
Most recently, The New York Times accused OpenAI of unintentionally erasing programs and search results that the newspaper believed could be used as evidence of copyright abuse.
The NYT apparently spent more than 150 hours extracting training data, while following a model inspection protocol that OpenAI set up precisely to avoid conducting potentially damning searches of its own database. This process began in October, but by mid-November, the NYT discovered that some of the data gathered had been erased due to what OpenAI called a "glitch."
Looking to update the court about potential delays in discovery, the NYT asked OpenAI to collaborate on a joint filing admitting the deletion occurred. But OpenAI declined, instead filing a separate response calling the newspaper's accusation that evidence was deleted "exaggerated" and blaming the NYT for the technical problem that triggered the data deleting.
OpenAI denied deleting "any evidence," instead admitting only that file-system information was "inadvertently removed" after the NYT requested a change that resulted in "self-inflicted wounds." According to OpenAI, the tech problem emerged because NYT was hoping to speed up its searches and requested a change to the model inspection set-up that OpenAI warned "would yield no speed improvements and might even hinder performance."
The AI company accused the NYT of negligence during discovery, "repeatedly running flawed code" while conducting searches of URLs and phrases from various newspaper articles and failing to back up their data. Allegedly the change that NYT requested "resulted in removing the folder structure and some file names on one hard drive," which "was supposed to be used as a temporary cache for storing OpenAI data, but evidently was also used by Plaintiffs to save some of their search results (apparently without any backups)."
Once OpenAI figured out what happened, data was restored, OpenAI said. But the NYT alleged that the only data that OpenAI could recover did "not include the original folder structure and original file names" and therefore "is unreliable and cannot be used to determine where the News Plaintiffs' copied articles were used to build Defendants' models."
In response, OpenAI suggested that the NYT could simply take a few days and re-run the searches, insisting, "contrary to Plaintiffs' insinuations, there is no reason to think that the contents of any files were lost." But the NYT does not seem happy about having to retread any part of model inspection, continually frustrated by OpenAI's expectation that plaintiffs must come up with search terms when OpenAI understands its models best.
OpenAI claimed that it has consulted on search terms and been "forced to pour enormous resources" into supporting the NYT's model inspection efforts while continuing to avoid saying how much it's costing. Previously, the NYT accused OpenAI of seeking to profit off these searches, attempting to charge retail prices instead of being transparent about actual costs.
Now, OpenAI appears to be more willing to conduct searches on behalf of NYT that it previously sought to avoid. In its filing, OpenAI asked the court to order news plaintiffs to "collaborate with OpenAI to develop a plan for reasonable, targeted searches to be executed either by Plaintiffs or OpenAI."
How that might proceed will be discussed at a hearing on December 3. OpenAI said it was committed to preventing future technical issues and was "committed to resolving these issues efficiently and equitably."
A former OpenAI researcher known for whistleblowing the blockbuster artificial intelligence company facing a swell of lawsuits over its business model has died, authorities confirmed this week.
Suchir Balaji, 26, was found dead inside his Buchanan Street apartment on Nov. 26, San Francisco police and the Office of the Chief Medical Examiner said. Police had been called to the Lower Haight residence at about 1 p.m. that day, after receiving a call asking officers to check on his well-being, a police spokesperson said.
The medical examiner's office determined the manner of death to be suicide and police officials this week said there is "currently, no evidence of foul play."
[...] In a Nov. 18 letter filed in federal court, attorneys for The New York Times named Balaji as someone who had "unique and relevant documents" that would support their case against OpenAI. He was among at least 12 people — many of them past or present OpenAI employees — the newspaper had named in court filings as having material helpful to their case, ahead of depositions.
The head of the US Copyright Office has reportedly been fired, the day after agency concluded that builders of AI models use of copyrighted material went beyond existing doctrines of fair use.
The office’s opinion on fair use came in a draft of the third part of its report on copyright and artificial intelligence. The first part considered digital replicas and the second tackled whether it is possible to copyright the output of generative AI.
The office published the draft [PDF] of Part 3, which addresses the use of copyrighted works in the development of generative AI systems, on May 9th.
The draft notes that generative AI systems “draw on massive troves of data, including copyrighted works” and asks: “Do any of the acts involved require the copyright owners’ consent or compensation?”
That question is the subject of several lawsuits, because developers of AI models have admitted to training their products on content scraped from the internet and other sources without compensating content creators or copyright owners. AI companies have argued fair use provisions of copyright law mean they did no wrong.
As the report notes, one test courts use to determine fair use considers “the effect of the use upon the potential market for or value of the copyrighted work”. If a judge finds an AI company’s use of copyrighted material doesn’t impact a market or value, fair use will apply.
The report finds AI companies can’t sustain a fair use defense in the following circumstances:
When a model is deployed for purposes such as analysis or research… the outputs are unlikely to substitute for expressive works used in training. But making commercial use of vast troves of copyrighted works to produce expressive content that competes with them in existing markets, especially where this is accomplished through illegal access, goes beyond established fair use boundaries.
The office will soon publish a final version of Part 3 that it expects will emerge “without any substantive changes expected in the analysis or conclusions.”
The Tech Industry Said It Was "Impossible" to Create AI Based Entirely on Ethically-Sourced Data, So These Scientists Proved Them Wrong in Spectacular Fashion:
A team of more than two dozen AI researchers from MIT, Cornell University, the University of Toronto, and other institutions have trained a large language model only using data that was openly licensed or in the public domain, the Washington Post reports, providing a blueprint for ethically developing the technology.
But, as the creators readily admit, it was far from easy.
As they describe in a yet-to-be-peer-reviewed paper published this week, it quickly became apparent that it wouldn't be computing power holding them back, but personpower.
That's because the text in the over eight terabyte dataset they put together, which they're calling the Common Pile v0.1, had to be manually cleaned up and reformatted to make it suitable for AI training, WaPo explains. Then there was the amazing amount of extra legwork that had to be done of doublechecking the copyright status of all the data, since many online works are improperly licensed.
"This isn't a thing where you can just scale up the resources that you have available," like access to more computer chips and a fancy web scraper, study coauthor Stella Biderman, a computer scientist and executive director of the nonprofit Eleuther AI, told WaPo. "We use automated tools, but all of our stuff was manually annotated at the end of the day and checked by people. And that's just really hard."
Still, Biderman and her colleagues did get the job done.
Once the painstaking odyssey of creating the Common Pile was over, they used their guilt-free dataset to train a seven billion-parameter LLM. The result? An AI that admirably stacks up against industry models like Meta's Llama 1 and Llama 2 7B — which is impressive, but those were versions released over two years ago. That's practically a lifetime in the AI race.
[...] This latest work is a rebuff to that Silicon Valley line, though it doesn't obviate all ethical concerns. This is still a large language model, a technology fundamentally intended to destroy jobs, and perhaps not everyone whose work has ended up in the public domain would be happy with it being regurgitated by AI — if they aren't dead artists whose copyright has elapsed, of course.
[...] Biderman herself doesn't have any illusions that the likes of OpenAI will suddenly turn over a new leaf and start being paragons of ethical data sourcing. But she hopes her work will at least get them to stop hiding what they're using to train their AI models.
"Even partial transparency has a huge amount of social value and a moderate amount of scientific value," she told WaPo.
Lawsuit: Meta may have seeded porn to minors while hiding piracy for AI training:
Porn sites may have blown up Meta's key defense in a copyright fight with book authors who earlier this year said that Meta torrented "at least 81.7 terabytes of data across multiple shadow libraries" to train its AI models.
Meta has defeated most of the authors' claims and claimed there is no proof that Meta ever uploaded pirated data through seeding or leeching on the BitTorrent network used to download training data. But authors still have a chance to prove that Meta may have profited off its massive piracy, and a new lawsuit filed by adult sites last week appears to contain evidence that could help authors win their fight, TorrentFreak reported.
The new lawsuit was filed last Friday in a US district court in California by Strike 3 Holdings—which says it attracts "over 25 million monthly visitors" to sites that serve as "ethical sources" for adult videos that "are famous for redefining adult content with Hollywood style and quality."
After authors revealed Meta's torrenting, Strike 3 Holdings checked its proprietary BitTorrent-tracking tools designed to detect infringement of its videos and alleged that the company found evidence that Meta has been torrenting and seeding its copyrighted content for years—since at least 2018. Some of the IP addresses were clearly registered to Meta, while others appeared to be "hidden," and at least one was linked to a Meta employee, the filing said.
According to Strike 3 Holdings, Meta "willfully and intentionally" infringed "at least 2,396 movies" as part of a strategy to download terabytes of data as fast as possible by seeding popular high-quality porn. Supposedly, Meta continued seeding the content "sometimes for days, weeks, or even months" after downloading them, and these movies may also have been secretly used to train Meta's AI models, Strike 3 Holdings alleged.
The porn site operator explained to the court that BitTorrent's protocol establishes a "tit-for-tat" mechanism that "rewards users who distribute the most desired content." It alleged that Meta took advantage of this system by "often" pirating adult videos that are "often within the most infringed files on BitTorrent websites" on "the very same day the motion pictures are released."
These tactics allegedly gave Meta several advantages, making it harder for Strike 3 Holdings' sites to compete, including potentially distributing the videos to minors for free without age checks in states that now require them.
OpenAI desperate to avoid explaining why it deleted pirated book datasets:
OpenAI may soon be forced to explain why it deleted a pair of controversial datasets composed of pirated books, and the stakes could not be higher.
At the heart of a class-action lawsuit from authors alleging that ChatGPT was illegally trained on their works, OpenAI's decision to delete the datasets could end up being a deciding factor that gives the authors the win.
It's undisputed that OpenAI deleted the datasets, known as "Books 1" and "Books 2," prior to ChatGPT's release in 2022. Created by former OpenAI employees in 2021, the datasets were built by scraping the open web and seizing the bulk of its data from a shadow library called Library Genesis (LibGen).
As OpenAI tells it, the datasets fell out of use within that same year, prompting an internal decision to delete them.
But the authors suspect there's more to the story than that. They noted that OpenAI appeared to flip-flop by retracting its claim that the datasets' "non-use" was a reason for deletion, then later claiming that all reasons for deletion, including "non-use," should be shielded under attorney-client privilege.
To the authors, it seemed like OpenAI was quickly backtracking after the court granted the authors' discovery requests to review OpenAI's internal messages on the firm's "non-use."
In fact, OpenAI's reversal only made authors more eager to see how OpenAI discussed "non-use," and now they may get to find out all the reasons why OpenAI deleted the datasets.
Last week, US magistrate judge Ona Wang ordered OpenAI to share all communications with in-house lawyers about deleting the datasets, as well as "all internal references to LibGen that OpenAI has redacted or withheld on the basis of attorney-client privilege."
According to Wang, OpenAI slipped up by arguing that "non-use" was not a "reason" for deleting the datasets, while simultaneously claiming that it should also be deemed a "reason" considered privileged.
Either way, the judge ruled that OpenAI couldn't block discovery on "non-use" just by deleting a few words from prior filings that had been on the docket for more than a year.
"OpenAI has gone back-and-forth on whether 'non-use' as a 'reason' for the deletion of Books1 and Books2 is privileged at all," Wang wrote. "OpenAI cannot state a 'reason' (which implies it is not privileged) and then later assert that the 'reason' is privileged to avoid discovery."
Additionally, OpenAI's claim that all reasons for deleting the datasets are privileged "strains credulity," she concluded, ordering OpenAI to produce a wide range of potentially revealing internal messages by December 8. OpenAI must also make its in-house lawyers available for deposition by December 19.
OpenAI has argued that it never flip-flopped or retracted anything. It simply used vague phrasing that led to confusion over whether any of the reasons for deleting the datasets were considered non-privileged. But Wang didn't buy into that, concluding that "even if a 'reason' like 'non-use' could be privileged, OpenAI has waived privilege by making a moving target of its privilege assertions."
Asked for comment, OpenAI told Ars that "we disagree with the ruling and intend to appeal."
I thought we were past this. These people are not deprived of the use of their creations. They have not been stolen from. Copyright infringement is not theft.