GitHub’s automatic coding tool rests on untested legal ground:
The Copilot tool has been trained on mountains of publicly available code
[...] When GitHub announced Copilot on June 29, the company said that the algorithm had been trained on publicly available code posted to GitHub. Nat Friedman, GitHub’s CEO, has written on forums like Hacker News and Twitter that the company is legally in the clear. “Training machine learning models on publicly available data is considered fair use across the machine learning community,” the Copilot page says.
But the legal question isn’t as settled as Friedman makes it sound — and the confusion reaches far beyond just GitHub. Artificial intelligence algorithms only function due to massive amounts of data they analyze, and much of that data comes from the open internet. An easy example would be ImageNet, perhaps the most influential AI training dataset, which is entirely made up of publicly available images that ImageNet creators do not own. If a court were to say that using this easily accessible data isn’t legal, it could make training AI systems vastly more expensive and less transparent.
Despite GitHub’s assertion, there is no direct legal precedent in the US that upholds publicly available training data as fair use, according to Mark Lemley and Bryan Casey of Stanford Law School, who published a paper last year about AI datasets and fair use in the Texas Law Review.
[...] And there are past cases to support that opinion, they say. They consider the Google Books case, in which Google downloaded and indexed more than 20 million books to create a literary search database, to be similar to training an algorithm. The Supreme Court upheld Google’s fair use claim, on the grounds that the new tool was transformative of the original work and broadly beneficial to readers and authors.
Microsoft’s GitHub Copilot Met with Backlash from Open Source Copyright Advocates:
GitHub Copilot system runs on a new AI platform developed by OpenAI known as Codex. Copilot is designed to help programmers across a wide range of languages. That includes popular scripts like JavaScript, Ruby, Go, Python, and TypeScript, but also many more languages.
“GitHub Copilot understands significantly more context than most code assistants. So, whether it’s in a docstring, comment, function name, or the code itself, GitHub Copilot uses the context you’ve provided and synthesizes code to match. Together with OpenAI, we’re designing GitHub Copilot to get smarter at producing safe and effective code as developers use it.”
One of the main criticisms regarding Copilot is it goes against the ethos of open source because it is a paid service. However, Microsoft would arguably justify this by saying the resources needed to train the AI are costly. Still, the training is problematic for some people because they argue Copilot is using snippets of code to train and then charging users.
Is it fair use to auto-suggest snippets of code that are under an open source copyright license? Does that potentially bring your code under that license by using Copilot?
One glorious day code will write itself without developers developers.
See Also:
CoPilot on GitHub
Twitter: GitHub Support just straight up confirmed in an email that yes, they used all public GitHub code, for Codex/Copilot regardless of license.
Hacker News: GitHub confirmed using all public code for training copilot regardless license
OpenAI warns AI behind GitHub’s Copilot may be susceptible to bias
(Score: 2) by c0lo on Friday July 09 2021, @02:05AM (9 children)
Doesn't mean a thing. If any provisions of a contract is illegal, it doesn't matter if you agreed with them or not, they are still illegal.
Yes, they can show your code to others if it's not hosted under a private repository. Others that see your code cannot use your code outside the license under your code is released.
https://www.youtube.com/watch?v=aoFiw2jMy-0 https://soylentnews.org/~MichaelDavidCrawford
(Score: 2) by darkfeline on Friday July 09 2021, @02:13AM (8 children)
> If any provisions of a contract is illegal, it doesn't matter if you agreed with them or not, they are still illegal.
Those provisions are not only not illegal, but standard for most online services that host user content.
> Others that see your code cannot use your code outside the license under your code is released.
And that is relevant how? Github can use your code because you granted Github a license to do so. Whether non-Github entities can use your code is irrelevant to whether Github can use your code for their Copilot product.
Join the SDF Public Access UNIX System today!
(Score: 2) by c0lo on Friday July 09 2021, @02:32AM (2 children)
Until you trip over a corner case, and a law suit carves an exception and creates a precedent. You can't implicitly assume ToSes are legal in all cases for all time.
Not if someone sues and wins on the grounds that, in those particular circumstances, GitHub doing so facilitated copyright infringement. Or outright committed infringement by creating a derivative work that substantially uses yours (beyond what fair use provisions allow).
The probability for this to happen? Low indeed. But not impossible, especially if your work falls within a narrow special area where not much other code exist to train that AI.
https://www.youtube.com/watch?v=aoFiw2jMy-0 https://soylentnews.org/~MichaelDavidCrawford
(Score: 2) by darkfeline on Friday July 09 2021, @09:52AM (1 child)
https://docs.github.com/en/github/site-policy/github-terms-of-service#o-limitation-of-liability [github.com]
By using Github, you indemnified Github for any liability that may arise from "facilitated copyright infringement" through an AI block box. You would have to prove intent or gross negligence recognized by others in the nascent ML field. Sure, there's a minuscule chance that a court may find Github liable in the future, but now we are extremely far out from the "Untested Legal Ground" claim (disregarding the nitpick that any situation could be considered "Untested Legal Ground" due to the unique configuration of matter in the universe in that moment).
Join the SDF Public Access UNIX System today!
(Score: 2) by c0lo on Saturday July 10 2021, @04:56AM
I don't see where I'm giving up my right to copyright, especially if Github were to be instrumental in infringing the copyright, no matter how they did it: by human operator or by running an algorithm. It is their AI that creates a derivative work from a copyrighted one, unless they receive an explicit license from the author to do it, there's no indemnification for them.
Mind you, it's not only the GPLed software that they potentially infringe. MIT license says "you can do whatever you want as long as you reproduce this very license in your code" - if they strip the license in the process of AI-fycation (creating a derivative work), they are in trouble straight away.
https://www.youtube.com/watch?v=aoFiw2jMy-0 https://soylentnews.org/~MichaelDavidCrawford
(Score: 4, Informative) by http on Friday July 09 2021, @03:31AM (4 children)
Try reading what you posted a second time, paying careful attention:
Using hosted code to train an AI does not count towards providing the service of making a repository publicly available.
I browse at -1 when I have mod points. It's unsettling.
(Score: 2) by bradley13 on Friday July 09 2021, @05:34AM
Sure it does. If they claim co-pilot is part of their service, then it's covered by the ToS.
Everyone is somebody else's weirdo.
(Score: 2) by darkfeline on Friday July 09 2021, @06:00AM (2 children)
https://docs.github.com/en/github/site-policy/github-terms-of-service#a-definitions [github.com]
It's ironic how accurate the subject of this thread is.
Join the SDF Public Access UNIX System today!
(Score: 2) by PiMuNu on Friday July 09 2021, @09:31AM
> It's ironic how accurate the subject of this thread is.
I think rather it means you have to be *ultra* careful when reading ToS in order to understand what it really means. Which is exactly why everyone just clicks "Accept".
(Score: 1, Insightful) by Anonymous Coward on Sunday July 11 2021, @09:11AM
Say at some point in time you read the ToS, review the services offered at that time, agree to the ToS based on those, have you given GitHub permission to use your content for those services or for any services they may think of later on? I don't think the latter is legal everywhere in the world. Perhaps it is in the US, but as far as I'm aware EU law is based on ideas on what is reasonable that don't include things like this, you're supposed to be able to oversee what you agree too, and an "anything we can think of in the future" clause, explicit or implicit, conflicts with that.