Stories
Slash Boxes
Comments

SoylentNews is people

posted by martyb on Friday July 09 2021, @12:52AM   Printer-friendly
from the we-violate-all-open-source-licenses-equally dept.

GitHub’s automatic coding tool rests on untested legal ground:

The Copilot tool has been trained on mountains of publicly available code

[...] When GitHub announced Copilot on June 29, the company said that the algorithm had been trained on publicly available code posted to GitHub. Nat Friedman, GitHub’s CEO, has written on forums like Hacker News and Twitter that the company is legally in the clear. “Training machine learning models on publicly available data is considered fair use across the machine learning community,” the Copilot page says.

But the legal question isn’t as settled as Friedman makes it sound — and the confusion reaches far beyond just GitHub. Artificial intelligence algorithms only function due to massive amounts of data they analyze, and much of that data comes from the open internet. An easy example would be ImageNet, perhaps the most influential AI training dataset, which is entirely made up of publicly available images that ImageNet creators do not own. If a court were to say that using this easily accessible data isn’t legal, it could make training AI systems vastly more expensive and less transparent.

Despite GitHub’s assertion, there is no direct legal precedent in the US that upholds publicly available training data as fair use, according to Mark Lemley and Bryan Casey of Stanford Law School, who published a paper last year about AI datasets and fair use in the Texas Law Review.

[...] And there are past cases to support that opinion, they say. They consider the Google Books case, in which Google downloaded and indexed more than 20 million books to create a literary search database, to be similar to training an algorithm. The Supreme Court upheld Google’s fair use claim, on the grounds that the new tool was transformative of the original work and broadly beneficial to readers and authors.

Microsoft’s GitHub Copilot Met with Backlash from Open Source Copyright Advocates:

GitHub Copilot system runs on a new AI platform developed by OpenAI known as Codex. Copilot is designed to help programmers across a wide range of languages. That includes popular scripts like JavaScript, Ruby, Go, Python, and TypeScript, but also many more languages.

“GitHub Copilot understands significantly more context than most code assistants. So, whether it’s in a docstring, comment, function name, or the code itself, GitHub Copilot uses the context you’ve provided and synthesizes code to match. Together with OpenAI, we’re designing GitHub Copilot to get smarter at producing safe and effective code as developers use it.”

One of the main criticisms regarding Copilot is it goes against the ethos of open source because it is a paid service. However, Microsoft would arguably justify this by saying the resources needed to train the AI are costly. Still, the training is problematic for some people because they argue Copilot is using snippets of code to train and then charging users.

Is it fair use to auto-suggest snippets of code that are under an open source copyright license? Does that potentially bring your code under that license by using Copilot?

One glorious day code will write itself without developers developers.

See Also:
CoPilot on GitHub
Twitter: GitHub Support just straight up confirmed in an email that yes, they used all public GitHub code, for Codex/Copilot regardless of license.
Hacker News: GitHub confirmed using all public code for training copilot regardless license
OpenAI warns AI behind GitHub’s Copilot may be susceptible to bias


Original Submission

 
This discussion has been archived. No new comments can be posted.
Display Options Threshold/Breakthrough Mark All as Read Mark All as Unread
The Fine Print: The following comments are owned by whoever posted them. We are not responsible for them in any way.
  • (Score: 4, Insightful) by DannyB on Friday July 09 2021, @03:01PM (6 children)

    by DannyB (5839) Subscriber Badge on Friday July 09 2021, @03:01PM (#1154304) Journal

    John writes a GPL program.

    Later, Jane, without John's knowledge, puts this program's source code on GitHub.

    John did not authorize Microsoft / GitHub to publish this code under any terms other than the GPL. This includes when snippets of this GPL code are inserted into proprietary closed source projects because the developer is using CoPilot to auto-suggest snippets.

    I think John has cause to sue someone. Jane, GitHub, Microsoft, a developer using CoPilot who now has GPL code locked up in their proprietary code.

    This also brings up another important point. If you work on a proprietary product (as I do) then you better be sure you have a proper license for every last bit of code that is in your project which you didn't write yourself. If you include a commercial library, have a proper commercial license and are in compliance with it. If you have open source in your project, make sure you are in full compliance with the license. (eg, Apache 2, BSD, MIT, etc)

    In the Java world there is an embarrassing amount of high quality third party libraries. The licenses on these are very commercial friendly. But ALWAYS review the license. Get management approval.

    The commercial friendly licensing is because the users of these are large commercial interests writing large commercial closed source Java programs. And many of the same corporations that consume this open source also sponsor various open source Java projects. Why does Red Hat spend significant resources developing a state of the art Garbage Collector (Shenandoah) for Java? Because their biggest customers are on Java. Why does IBM and Microsoft invest in Java? Same reason.

    CoPilot brings a whole new vector where unlicensed code can make its way in to your source code base. This is similar to, but worse than copy/pasting some code you googled from, say, Stack Overflow. If you don't have a license for it, then don't copy/paste it in. Understand it, and then do what it does in your own code.

    --
    People today are educated enough to repeat what they are taught but not to question what they are taught.
    Starting Score:    1  point
    Moderation   +2  
       Insightful=2, Total=2
    Extra 'Insightful' Modifier   0  
    Karma-Bonus Modifier   +1  

    Total Score:   4  
  • (Score: 0) by Anonymous Coward on Friday July 09 2021, @04:31PM

    by Anonymous Coward on Friday July 09 2021, @04:31PM (#1154360)

    But its AI, as in "Intelligent": its not copy-pasting code - it "learned" the code and is using its "knowledge" to synthesize new code on the spot!

  • (Score: 2) by darkfeline on Friday July 09 2021, @07:39PM (3 children)

    by darkfeline (1030) on Friday July 09 2021, @07:39PM (#1154421) Homepage

    John does not have cause to sue for monetary damages as likely he cannot demonstrate any monetary damages. He can demand cease and desist for Jane to stop unauthorized re-licensing of his code, if Jane is still doing so. He can demand Github remove the code, as the copyright holder did not agree to Github's ToS. Microsoft, as the owner of Github as a subsidiary, is not involved at all. I highly doubt John would able to prove a developer somewhere got the exact same code that he wrote from CoPilot, and enough of it to constitute copyright infringement. If he could, then he could also demand said developer to cease and desist.

    For some reason, people seem to be assuming CoPilot is straight up copying sections of code fed into it. That's not how AI trained on broad datasets work, unless the code is generic enough that copyright would no longer be applicable in the first place (e.g. a function that adds two arguments together).

    --
    Join the SDF Public Access UNIX System today!
    • (Score: 0) by Anonymous Coward on Saturday July 10 2021, @04:35AM

      by Anonymous Coward on Saturday July 10 2021, @04:35AM (#1154541)

      John does not have cause to sue for monetary damages as likely he cannot demonstrate any monetary damages.

      In the United States, John may choose to seek statutory damages which (if successful) entitles him to monetary relief of no less than $750 per work infringed and does not require him to demonstrate any actual damages whatsoever. John must have registered his copyright prior to the alleged infringement in order to be eligible for statutory relief.

      Other jurisdictions may have similar mechanisms.

    • (Score: 0) by Anonymous Coward on Saturday July 10 2021, @04:55AM

      by Anonymous Coward on Saturday July 10 2021, @04:55AM (#1154543)

      For some reason, people seem to be assuming CoPilot is straight up copying sections of code fed into it. That's not how AI trained on broad datasets work, unless the code is generic enough that copyright would no longer be applicable in the first place (e.g. a function that adds two arguments together).

      GitHub's own research [github.com] says "once, GitHub Copilot suggested starting an empty file with something it had even seen more than a whopping 700,000 different times during training -- that was the GNU General Public License."

      So yes it does sound like this tool is absolutely capable of straight up regurgitating significant amounts of nontrivial and copyrightable text, verbatim, that was part of its training set.

    • (Score: 0) by Anonymous Coward on Sunday July 11 2021, @09:29AM

      by Anonymous Coward on Sunday July 11 2021, @09:29AM (#1154802)

      If GitHub (Microsoft) argues there is no problem with copyright there is a simple solution to make everybody happy: train the AI on the huge proprietary code base Microsoft owns.

      It's supposed to be code developed by the very best developers in the world, according to what Microsofties told me during a conversion project to their platform I was part of once, so it must be excellent, and without any copyright problems there is no reason not to use it.

  • (Score: 0) by Anonymous Coward on Friday July 09 2021, @09:26PM

    by Anonymous Coward on Friday July 09 2021, @09:26PM (#1154460)

    i want my cut of github and MS's hide! yeehaaaawwww!