GitHub’s automatic coding tool rests on untested legal ground:
The Copilot tool has been trained on mountains of publicly available code
[...] When GitHub announced Copilot on June 29, the company said that the algorithm had been trained on publicly available code posted to GitHub. Nat Friedman, GitHub’s CEO, has written on forums like Hacker News and Twitter that the company is legally in the clear. “Training machine learning models on publicly available data is considered fair use across the machine learning community,” the Copilot page says.
But the legal question isn’t as settled as Friedman makes it sound — and the confusion reaches far beyond just GitHub. Artificial intelligence algorithms only function due to massive amounts of data they analyze, and much of that data comes from the open internet. An easy example would be ImageNet, perhaps the most influential AI training dataset, which is entirely made up of publicly available images that ImageNet creators do not own. If a court were to say that using this easily accessible data isn’t legal, it could make training AI systems vastly more expensive and less transparent.
Despite GitHub’s assertion, there is no direct legal precedent in the US that upholds publicly available training data as fair use, according to Mark Lemley and Bryan Casey of Stanford Law School, who published a paper last year about AI datasets and fair use in the Texas Law Review.
[...] And there are past cases to support that opinion, they say. They consider the Google Books case, in which Google downloaded and indexed more than 20 million books to create a literary search database, to be similar to training an algorithm. The Supreme Court upheld Google’s fair use claim, on the grounds that the new tool was transformative of the original work and broadly beneficial to readers and authors.
Microsoft’s GitHub Copilot Met with Backlash from Open Source Copyright Advocates:
GitHub Copilot system runs on a new AI platform developed by OpenAI known as Codex. Copilot is designed to help programmers across a wide range of languages. That includes popular scripts like JavaScript, Ruby, Go, Python, and TypeScript, but also many more languages.
“GitHub Copilot understands significantly more context than most code assistants. So, whether it’s in a docstring, comment, function name, or the code itself, GitHub Copilot uses the context you’ve provided and synthesizes code to match. Together with OpenAI, we’re designing GitHub Copilot to get smarter at producing safe and effective code as developers use it.”
One of the main criticisms regarding Copilot is it goes against the ethos of open source because it is a paid service. However, Microsoft would arguably justify this by saying the resources needed to train the AI are costly. Still, the training is problematic for some people because they argue Copilot is using snippets of code to train and then charging users.
Is it fair use to auto-suggest snippets of code that are under an open source copyright license? Does that potentially bring your code under that license by using Copilot?
One glorious day code will write itself without developers developers.
See Also:
CoPilot on GitHub
Twitter: GitHub Support just straight up confirmed in an email that yes, they used all public GitHub code, for Codex/Copilot regardless of license.
Hacker News: GitHub confirmed using all public code for training copilot regardless license
OpenAI warns AI behind GitHub’s Copilot may be susceptible to bias
(Score: 2) by DannyB on Friday July 09 2021, @08:53PM (11 children)
In an IDE, like I use, that analysis and warnings are done keystroke by keystroke.
If I change something that breaks a function in another file, then on the exact keystroke which does this, that file name turns red in the tree structure of files in the project.
The thing to remember about the saying "you are what you are" is, that saying: is what it is.
(Score: 2) by JoeMerchant on Saturday July 10 2021, @01:29AM (10 children)
We had a vendor's software bug turn out to be use of an uninitialized variable. My comment on the matter was that my editor flags those in realtime, not to mention the compiler warnings.
Ignoring warnings like that is how we end up with procedures demanding zero compiler warnings at max settings.
🌻🌻 [google.com]
(Score: 2) by DannyB on Monday July 12 2021, @01:39PM (9 children)
It should not be a warning. It should be a fatal error that prevents a successful compile.
If the language definition is going to allow uninitialized variables, then it should define what they actually get initialized to, and that should be something sane that follows the principle of least astonishment.
But I would strongly prefer uninitialized variables be a fatal error. If they programmer cannot be bothered to specify the initial value, leading to an undeterministic result, then maybe they are not that good of a programmer. If the language cannot prevent this from occurring, then maybe it's not that good of a language.
I remember forty years ago when I was using Pascal, and there were debates about how that language forced you to write code that was safe, the "bondage and discipline" language users pointed out the sad loss o an interplanetary mission (sorry, no longer remember which one) due to a type mismatch error in FORTRAN.
I simply cannot understand the mindset of people who think compilers, by default, should allow you to write unsafe code. Now I don't have a problem with having some kind of declaration or annotation around a block of lines, or module to say "I know what I'm doing, leave me alone and compile this". But that should be the exception, not the rule.
The thing to remember about the saying "you are what you are" is, that saying: is what it is.
(Score: 2) by JoeMerchant on Monday July 12 2021, @02:45PM (3 children)
I agree, in principle. In practice: programmers are human, as are code reviewers, testers, especially managers, and the rest of us. It happens, which is why we now have a procedure to document checking for it. People are still human, legend has it that there was a documented procedure for the Space Shuttle that required no less than 50 people to sign off that a support beam was removed from the cargo bay before the shuttle was rotated to vertical position. Nonetheless, after 50 people had signed off that the beam was removed, it wasn't, the shuttle was rotated vertical, the beam fell and did millions in damage and weeks in schedule slip.
When our procedure fails to catch the next one, we will up the game to require all compilers to be set with warnings as errors, but that's still no guarantee...
🌻🌻 [google.com]
(Score: 2) by DannyB on Monday July 12 2021, @03:16PM (2 children)
If the compiler checks for it, and it is a fatal error, then problem solved! Us poor fallible humans will get a message we cannot ignore when our program does not compile. This compile error will not make it to the review or testing stage.
The compiler is your first line of defense! Actually it is the language that is the first line of defense. The language should simply make it impossible to do things that have no possible meaning. All variables must be initialized.
About unit testing: the compiler is also your first line of unit testing. If it won't compile, it fails the first line of tests. No need to write all sorts of silly unit tests to check things the compiler should have checked. I always laugh at that for some languages where people write unit tests for things the compiler should have checked.
The thing to remember about the saying "you are what you are" is, that saying: is what it is.
(Score: 2) by JoeMerchant on Monday July 12 2021, @04:03PM (1 child)
Nothing is idiot proof. Never underestimate the ability of idiots to circumvent safety mechanisms.
🌻🌻 [google.com]
(Score: 3, Interesting) by DannyB on Monday July 12 2021, @04:31PM
You can write bad code in any language. However it doesn't hurt for a language to have safety so that fallible humans don't make silly mistakes. Uninitialized variables are an excellent example of something that doesn't make sense. The compiler should be able to prove that you are accessing a variable prior to assigning it a value.
I'm not arguing that the compiler should try to deeply analyze the thought process of your code, how it works, and then be a critic. Just don't allow common mistakes, especially when they don't have any sensible meaning.
We could all program in assembly. Or in C. I strongly suspect there is an economic reason why we don't all program in C or assembly. And I further suspect that economic reasoning has to do with both productivity and safety. And safety is also a form of productivity and testing.
The thing to remember about the saying "you are what you are" is, that saying: is what it is.
(Score: 2) by JoeMerchant on Monday July 12 2021, @02:49PM (4 children)
Oh, I know the type. The ones that hand optimize the assembly because what they are doing is so simple and compact that they think that optimizing the last 0.00001% of performance into their code is worth the risk, they're smart guys, they know what they're doing. I've met them in implantable medical devices (and watched them repeatedly backpedal obvious over-optimizations only after they were thrown in their face as severe real world problems), I've met them in high speed motor controls, I'm sure they're out there in a lot of industries.
🌻🌻 [google.com]
(Score: 2) by DannyB on Monday July 12 2021, @03:30PM (3 children)
If it is mission critical to hand optimize something in assembly to the last possible clock cycle, then do that. Whatever it costs.
In reality there are few, if any, cases where that is actually mission critical.
In the 21st century hardware is way, way cheaper than developer time. Yes, I know this wasn't always true. Once computers were expensive and developers were cheap. Thus is made huge economic sense to have developers optimize as much as possible to get best economic use of the hardware. Now hardware is dirt cheap. Developers are very expensive. For most things in real life, it is cheaper to just better hardware instead of do that optimization.
Also compiler optimization has come a long, long way since the 1970s.
In those edge cases, if they actually even exist, where it is mission critical to optimize the last possible cpu cycle, then do so. Because cost comes secondary to accomplishing the mission.
The thing to remember about the saying "you are what you are" is, that saying: is what it is.
(Score: 2) by JoeMerchant on Monday July 12 2021, @04:11PM (2 children)
In the implantable devices, the perennial excuse was extension of battery life. Approximate real world battery life was maybe 3.5 years, advertised battery life under the clearly unrealistic specified conditions was 7 years. They would do boneheaded things like have an 8 bit checksum on a communication which was estimated to add 2 weeks to that 7 year figure as compared to a 16 bit checksum. Then the 8 bit checksum would allow painful (and unapproved) levels of stimulation to be programmed in error, with dozens of reports from the field, and they implemented a programmer side patch that ate 6 weeks off that 7 year figure. They failed to thermally compensate the battery voltage readings because the extra computation would shave 3 weeks off the 7 year figure, justification being: it's implanted, temperature is stable around body temperature. Yeah, well, geniuses, before it gets implanted it does a battery check on itself and reports itself dead if it has been stored below 50F, which happens - a lot - in the real world. At least that one got caught in validation testing.
The motors - they were contractors, I can only imagine what internal process led them to save those two bytes of code required to initialize the variable.
🌻🌻 [google.com]
(Score: 2) by DannyB on Monday July 12 2021, @04:36PM (1 child)
You mention a very specific use case here.
In a medical device, I strongly suspect that safety is one of the highest priorities.
Hopefully the device has an opportunity to report itself dead prior to reaching the operating table.
The thing to remember about the saying "you are what you are" is, that saying: is what it is.
(Score: 2) by JoeMerchant on Monday July 12 2021, @06:34PM
That was the problem, the device was reporting itself dead because it was (willfully) ignorant of the thermal effect on its battery voltage - willfully ignorant in the name of saving a few nanojoules of energy.
🌻🌻 [google.com]