from the we-violate-all-open-source-licenses-equally dept.
The Copilot tool has been trained on mountains of publicly available code
[...] When GitHub announced Copilot on June 29, the company said that the algorithm had been trained on publicly available code posted to GitHub. Nat Friedman, GitHub’s CEO, has written on forums like Hacker News and Twitter that the company is legally in the clear. “Training machine learning models on publicly available data is considered fair use across the machine learning community,” the Copilot page says.
But the legal question isn’t as settled as Friedman makes it sound — and the confusion reaches far beyond just GitHub. Artificial intelligence algorithms only function due to massive amounts of data they analyze, and much of that data comes from the open internet. An easy example would be ImageNet, perhaps the most influential AI training dataset, which is entirely made up of publicly available images that ImageNet creators do not own. If a court were to say that using this easily accessible data isn’t legal, it could make training AI systems vastly more expensive and less transparent.
Despite GitHub’s assertion, there is no direct legal precedent in the US that upholds publicly available training data as fair use, according to Mark Lemley and Bryan Casey of Stanford Law School, who published a paper last year about AI datasets and fair use in the Texas Law Review.
[...] And there are past cases to support that opinion, they say. They consider the Google Books case, in which Google downloaded and indexed more than 20 million books to create a literary search database, to be similar to training an algorithm. The Supreme Court upheld Google’s fair use claim, on the grounds that the new tool was transformative of the original work and broadly beneficial to readers and authors.
“GitHub Copilot understands significantly more context than most code assistants. So, whether it’s in a docstring, comment, function name, or the code itself, GitHub Copilot uses the context you’ve provided and synthesizes code to match. Together with OpenAI, we’re designing GitHub Copilot to get smarter at producing safe and effective code as developers use it.”
One of the main criticisms regarding Copilot is it goes against the ethos of open source because it is a paid service. However, Microsoft would arguably justify this by saying the resources needed to train the AI are costly. Still, the training is problematic for some people because they argue Copilot is using snippets of code to train and then charging users.
Is it fair use to auto-suggest snippets of code that are under an open source copyright license? Does that potentially bring your code under that license by using Copilot?
One glorious day code will write itself without developers developers.
CoPilot on GitHub
Twitter: GitHub Support just straight up confirmed in an email that yes, they used all public GitHub code, for Codex/Copilot regardless of license.
Hacker News: GitHub confirmed using all public code for training copilot regardless license
OpenAI warns AI behind GitHub’s Copilot may be susceptible to bias
The Free Software Foundation (FSF) has published five of the white papers it funded regarding questions about Microsoft Copilot. After Microsoft acquired GitHub, it set up a machine learning system to cull through its archive of software, called Copilot. The approach chosen and even the basic activity raises many questions starting with those of licensing.
Microsoft GitHub's announcement of an AI-driven Service as a Software Substitute (SaaSS) program called Copilot -- which uses machine learning to autocomplete code for developers as they write software -- immediately raised serious questions for the free software movement and our ability to safeguard user and developer freedom. We felt these questions needed to be addressed, as a variety of serious implications were foreseen for the free software community and developers who use GitHub. These inquiries -- and others possibly yet to be discovered -- needed to be reviewed in depth.
In our call for papers, we set forth several areas of interest. Most of these areas centered around copyright law, questions of ownership for AI-generated code, and legal impacts for GitHub authors who use a GNU or other copyleft license(s) for their works. We are pleased to announce the community-provided research into these areas, and much more.
First, we want to thank everyone who participated by sending in their papers. We received a healthy response of twenty-two papers from members of the community. The papers weighed-in on the multiple areas of interest we had indicated in our announcement. Using an anonymous review process, we concluded there were five papers that would be best suited to inform the community and foster critical conversations to help guide our actions in the search for solutions.
As projected here back in October, there is now a class action lawsuit, albeit in its earliest stages, against Microsoft over its blatant license violation through its use of the M$ GitHub Copilot tool. The software project, Copilot, strips copyright licensing and attribution from existing copyrighted code on an unprecedented scale. The class action lawsuit insists that machine learning algorithms, often marketed as "Artificial Intelligence", are not exempt from copyright law nor are the wielders of such tools.
The $9 billion in damages is arrived at through scale. When M$ Copilot rips code without attribution and strips the copyright license from it, it violates the DMCA three times. So if olny 1% of its 1.2M users receive such output, the licenses were breached 12k times with translates to 36k DMCA violations, at a very low-ball estimate.
"If each user receives just one Output that violates Section 1202 throughout their time using Copilot (up to fifteen months for the earliest adopters), then GitHub and OpenAI have violated the DMCA 3,600,000 times. At minimum statutory damages of $2500 per violation, that translates to $9,000,000,000," the litigants stated.
Besides open-source licenses and DMCA (§ 1202, which forbids the removal of copyright-management information), the lawsuit alleges violation of GitHub's terms of service and privacy policies, the California Consumer Privacy Act (CCPA), and other laws.
The suit is on twelve (12) counts:
– Violation of the DMCA.
– Breach of contract. x2
– Tortuous interference.
– False designation of origin.
– Unjust enrichment.
– Unfair competition.
– Violation of privacy act.
– Civil conspiracy.
– Declaratory relief.
Furthermore, these actions are contrary to what GitHub stood for prior to its sale to M$ and indicate yet another step in ongoing attempts by M$ to undermine and sabotage Free and Open Source Software and the supporting communities.