Stories
Slash Boxes
Comments

SoylentNews is people

SoylentNews is powered by your submissions, so send in your scoop. Only 14 submissions in the queue.
posted by martyb on Friday July 09 2021, @12:52AM   Printer-friendly
from the we-violate-all-open-source-licenses-equally dept.

GitHub’s automatic coding tool rests on untested legal ground:

The Copilot tool has been trained on mountains of publicly available code

[...] When GitHub announced Copilot on June 29, the company said that the algorithm had been trained on publicly available code posted to GitHub. Nat Friedman, GitHub’s CEO, has written on forums like Hacker News and Twitter that the company is legally in the clear. “Training machine learning models on publicly available data is considered fair use across the machine learning community,” the Copilot page says.

But the legal question isn’t as settled as Friedman makes it sound — and the confusion reaches far beyond just GitHub. Artificial intelligence algorithms only function due to massive amounts of data they analyze, and much of that data comes from the open internet. An easy example would be ImageNet, perhaps the most influential AI training dataset, which is entirely made up of publicly available images that ImageNet creators do not own. If a court were to say that using this easily accessible data isn’t legal, it could make training AI systems vastly more expensive and less transparent.

Despite GitHub’s assertion, there is no direct legal precedent in the US that upholds publicly available training data as fair use, according to Mark Lemley and Bryan Casey of Stanford Law School, who published a paper last year about AI datasets and fair use in the Texas Law Review.

[...] And there are past cases to support that opinion, they say. They consider the Google Books case, in which Google downloaded and indexed more than 20 million books to create a literary search database, to be similar to training an algorithm. The Supreme Court upheld Google’s fair use claim, on the grounds that the new tool was transformative of the original work and broadly beneficial to readers and authors.

Microsoft’s GitHub Copilot Met with Backlash from Open Source Copyright Advocates:

GitHub Copilot system runs on a new AI platform developed by OpenAI known as Codex. Copilot is designed to help programmers across a wide range of languages. That includes popular scripts like JavaScript, Ruby, Go, Python, and TypeScript, but also many more languages.

“GitHub Copilot understands significantly more context than most code assistants. So, whether it’s in a docstring, comment, function name, or the code itself, GitHub Copilot uses the context you’ve provided and synthesizes code to match. Together with OpenAI, we’re designing GitHub Copilot to get smarter at producing safe and effective code as developers use it.”

One of the main criticisms regarding Copilot is it goes against the ethos of open source because it is a paid service. However, Microsoft would arguably justify this by saying the resources needed to train the AI are costly. Still, the training is problematic for some people because they argue Copilot is using snippets of code to train and then charging users.

Is it fair use to auto-suggest snippets of code that are under an open source copyright license? Does that potentially bring your code under that license by using Copilot?

One glorious day code will write itself without developers developers.

See Also:
CoPilot on GitHub
Twitter: GitHub Support just straight up confirmed in an email that yes, they used all public GitHub code, for Codex/Copilot regardless of license.
Hacker News: GitHub confirmed using all public code for training copilot regardless license
OpenAI warns AI behind GitHub’s Copilot may be susceptible to bias


Original Submission

 
This discussion has been archived. No new comments can be posted.
Display Options Threshold/Breakthrough Mark All as Read Mark All as Unread
The Fine Print: The following comments are owned by whoever posted them. We are not responsible for them in any way.
  • (Score: 2, Interesting) by Anonymous Coward on Friday July 09 2021, @01:13AM (8 children)

    by Anonymous Coward on Friday July 09 2021, @01:13AM (#1154142)

    Since it's been trained on open source code, I'm going to assume a lot of it isn't public domain, and that it's going to end up suggesting code that is covered y the GPL, or code that is "you can look but need a license to actually use in non-free products" and such.

    Should be "interesting" (as in "highly profitable") to fire it up and catch it offering to use non-public-domain code snippets, which can potentially make their way into proprietary code. Or even to open code that you forget to credit the author for.

    Starting Score:    0  points
    Moderation   +2  
       Interesting=2, Total=2
    Extra 'Interesting' Modifier   0  

    Total Score:   2  
  • (Score: 1, Interesting) by Anonymous Coward on Friday July 09 2021, @02:24AM (4 children)

    by Anonymous Coward on Friday July 09 2021, @02:24AM (#1154163)

    If this gets to be something important, it will have to be GPLed.

    And I LIKE IT.

    • (Score: 0) by Anonymous Coward on Friday July 09 2021, @03:59AM (3 children)

      by Anonymous Coward on Friday July 09 2021, @03:59AM (#1154185)
      Well, Microsoft always claimed the GPL was viral.
      • (Score: 0) by Anonymous Coward on Friday July 09 2021, @04:03AM

        by Anonymous Coward on Friday July 09 2021, @04:03AM (#1154186)

        For once, MS wasn't wrong. It's by design.

        And it's FUCKING Brilliant.

      • (Score: -1, Redundant) by Anonymous Coward on Friday July 09 2021, @04:16AM

        by Anonymous Coward on Friday July 09 2021, @04:16AM (#1154187)

        MS wasn't wrong.

        And it's FUCKING BRILLIANT.

      • (Score: 0) by Anonymous Coward on Friday July 09 2021, @11:17AM

        by Anonymous Coward on Friday July 09 2021, @11:17AM (#1154245)

        Well, Microsoft always claimed the GPL was viral.

        What do you mean "claimed"? GPL *is* a viral license by design. That is its sole purpose and why it was designed like it. It's also a reason why LGPL is not viewed positively by FSF since day 1 but it was deemed necessary to allow non-free software to actually run on free-software based OS.

        The viral nature of GPL is a detriment of it as a license. Microsoft legal was just shit scared that it would embed itself into something by accident (or maybe malice by some disgruntled employee) and then they would have some trolls sue it like SCO vs. IBM. I think they are more relaxed over it now.

        https://cloudblogs.microsoft.com/opensource/2018/03/19/microsoft-open-source-licensing-gplv3/ [microsoft.com]

  • (Score: 2) by HiThere on Friday July 09 2021, @03:17PM

    by HiThere (866) Subscriber Badge on Friday July 09 2021, @03:17PM (#1154318) Journal

    Just about NO code is public domain. Copyright laws have practically eliminated the existence of new public domain works. That's the reason for licenses such as Artistic, MIT, and BSD. And part of the reason for GPL.

    --
    Javascript is what you use to allow unknown third parties to run software you have no idea about on your computer.
  • (Score: 2) by JoeMerchant on Friday July 09 2021, @03:47PM (1 child)

    by JoeMerchant (3937) on Friday July 09 2021, @03:47PM (#1154339)

    At what point is a code snippet identifiable as non-public-domain? I'm sure that:

            int i = 0;

    appears in lots of GPL code, but does that make it covered by the GPL license?

    How about a long paragraph of code that appears in both GPL code and MIT code? Does the MIT license take precedent only if it was published first?

    --
    🌻🌻 [google.com]
    • (Score: 0) by Anonymous Coward on Friday July 09 2021, @05:35PM

      by Anonymous Coward on Friday July 09 2021, @05:35PM (#1154384)

      At what point is a code snippet identifiable as non-public-domain? I'm sure that:

                      int i = 0;

      appears in lots of GPL code, but does that make it covered by the GPL license?

      In order for a work to be copyrightable, it has to meet a minimum level of creativity. This is for courts to decide. The bar is pretty low but something like "int i = 0;", by itself, is unlikely to be considered a creative work eligible for copyright protection.

      If a work is not protected by copyright then the terms of a copyright license like the GPL are irrelevant.

      How about a long paragraph of code that appears in both GPL code and MIT code? Does the MIT license take precedent only if it was published first?

      Like two people who independently, and unaware of each other's work, write exactly the same program in exactly the same way?

      The creativity requirement should in principle prevent this from ever happening. If the work is considered copyrightable, and in the absence of any other information, the person who published first would have a pretty convincing argument that the other party used their work.