Stories
Slash Boxes
Comments

SoylentNews is people

posted by martyb on Friday July 09 2021, @12:52AM   Printer-friendly
from the we-violate-all-open-source-licenses-equally dept.

GitHub’s automatic coding tool rests on untested legal ground:

The Copilot tool has been trained on mountains of publicly available code

[...] When GitHub announced Copilot on June 29, the company said that the algorithm had been trained on publicly available code posted to GitHub. Nat Friedman, GitHub’s CEO, has written on forums like Hacker News and Twitter that the company is legally in the clear. “Training machine learning models on publicly available data is considered fair use across the machine learning community,” the Copilot page says.

But the legal question isn’t as settled as Friedman makes it sound — and the confusion reaches far beyond just GitHub. Artificial intelligence algorithms only function due to massive amounts of data they analyze, and much of that data comes from the open internet. An easy example would be ImageNet, perhaps the most influential AI training dataset, which is entirely made up of publicly available images that ImageNet creators do not own. If a court were to say that using this easily accessible data isn’t legal, it could make training AI systems vastly more expensive and less transparent.

Despite GitHub’s assertion, there is no direct legal precedent in the US that upholds publicly available training data as fair use, according to Mark Lemley and Bryan Casey of Stanford Law School, who published a paper last year about AI datasets and fair use in the Texas Law Review.

[...] And there are past cases to support that opinion, they say. They consider the Google Books case, in which Google downloaded and indexed more than 20 million books to create a literary search database, to be similar to training an algorithm. The Supreme Court upheld Google’s fair use claim, on the grounds that the new tool was transformative of the original work and broadly beneficial to readers and authors.

Microsoft’s GitHub Copilot Met with Backlash from Open Source Copyright Advocates:

GitHub Copilot system runs on a new AI platform developed by OpenAI known as Codex. Copilot is designed to help programmers across a wide range of languages. That includes popular scripts like JavaScript, Ruby, Go, Python, and TypeScript, but also many more languages.

“GitHub Copilot understands significantly more context than most code assistants. So, whether it’s in a docstring, comment, function name, or the code itself, GitHub Copilot uses the context you’ve provided and synthesizes code to match. Together with OpenAI, we’re designing GitHub Copilot to get smarter at producing safe and effective code as developers use it.”

One of the main criticisms regarding Copilot is it goes against the ethos of open source because it is a paid service. However, Microsoft would arguably justify this by saying the resources needed to train the AI are costly. Still, the training is problematic for some people because they argue Copilot is using snippets of code to train and then charging users.

Is it fair use to auto-suggest snippets of code that are under an open source copyright license? Does that potentially bring your code under that license by using Copilot?

One glorious day code will write itself without developers developers.

See Also:
CoPilot on GitHub
Twitter: GitHub Support just straight up confirmed in an email that yes, they used all public GitHub code, for Codex/Copilot regardless of license.
Hacker News: GitHub confirmed using all public code for training copilot regardless license
OpenAI warns AI behind GitHub’s Copilot may be susceptible to bias


Original Submission

 
This discussion has been archived. No new comments can be posted.
Display Options Threshold/Breakthrough Mark All as Read Mark All as Unread
The Fine Print: The following comments are owned by whoever posted them. We are not responsible for them in any way.
  • (Score: 4, Informative) by darkfeline on Friday July 09 2021, @01:12AM (28 children)

    by darkfeline (1030) on Friday July 09 2021, @01:12AM (#1154141) Homepage

    4. License Grant to Us We need the legal right to do things like host Your Content, publish it, and share it. You grant us and our legal successors the right to store, archive, parse, and display Your Content, and make incidental copies, as necessary to provide the Service, including improving the Service over time. This license includes the right to do things like copy it to our database and make backups; show it to you and other users; parse it into a search index or otherwise analyze it on our servers; share it with other users; and perform it, in case Your Content is something like music or video.

    This license does not grant GitHub the right to sell Your Content. It also does not grant GitHub the right to otherwise distribute or use Your Content outside of our provision of the Service, except that as part of the right to archive Your Content, GitHub may permit our partners to store and archive Your Content in public repositories in connection with the GitHub Arctic Code Vault and GitHub Archive Program.

    https://docs.github.com/en/github/site-policy/github-terms-of-service#4-license-grant-to-us [github.com]

    No untested legal ground here. By using GIthub, you agree to their ToS and gave them permission to do this.

    --
    Join the SDF Public Access UNIX System today!
    Starting Score:    1  point
    Moderation   +2  
       Informative=2, Total=2
    Extra 'Informative' Modifier   0  
    Karma-Bonus Modifier   +1  

    Total Score:   4  
  • (Score: 2) by c0lo on Friday July 09 2021, @02:05AM (9 children)

    by c0lo (156) Subscriber Badge on Friday July 09 2021, @02:05AM (#1154155) Journal

    Doesn't mean a thing. If any provisions of a contract is illegal, it doesn't matter if you agreed with them or not, they are still illegal.

    Yes, they can show your code to others if it's not hosted under a private repository. Others that see your code cannot use your code outside the license under your code is released.

    --
    https://www.youtube.com/watch?v=aoFiw2jMy-0 https://soylentnews.org/~MichaelDavidCrawford
    • (Score: 2) by darkfeline on Friday July 09 2021, @02:13AM (8 children)

      by darkfeline (1030) on Friday July 09 2021, @02:13AM (#1154157) Homepage

      > If any provisions of a contract is illegal, it doesn't matter if you agreed with them or not, they are still illegal.

      Those provisions are not only not illegal, but standard for most online services that host user content.

      > Others that see your code cannot use your code outside the license under your code is released.

      And that is relevant how? Github can use your code because you granted Github a license to do so. Whether non-Github entities can use your code is irrelevant to whether Github can use your code for their Copilot product.

      --
      Join the SDF Public Access UNIX System today!
      • (Score: 2) by c0lo on Friday July 09 2021, @02:32AM (2 children)

        by c0lo (156) Subscriber Badge on Friday July 09 2021, @02:32AM (#1154166) Journal

        Those provisions are not only not illegal, but standard for most online services that host user content.

        Until you trip over a corner case, and a law suit carves an exception and creates a precedent. You can't implicitly assume ToSes are legal in all cases for all time.

        Github can use your code because you granted Github a license to do so.

        Not if someone sues and wins on the grounds that, in those particular circumstances, GitHub doing so facilitated copyright infringement. Or outright committed infringement by creating a derivative work that substantially uses yours (beyond what fair use provisions allow).

        The probability for this to happen? Low indeed. But not impossible, especially if your work falls within a narrow special area where not much other code exist to train that AI.

        --
        https://www.youtube.com/watch?v=aoFiw2jMy-0 https://soylentnews.org/~MichaelDavidCrawford
        • (Score: 2) by darkfeline on Friday July 09 2021, @09:52AM (1 child)

          by darkfeline (1030) on Friday July 09 2021, @09:52AM (#1154232) Homepage

          O. Limitation of Liability

          Short version: We will not be liable for damages or losses arising from your use or inability to use the service or otherwise arising under this agreement. Please read this section carefully; it limits our obligations to you.

          You understand and agree that we will not be liable to you or any third party for any loss of profits, use, goodwill, or data, or for any incidental, indirect, special, consequential or exemplary damages, however arising, that result from

          • the use, disclosure, or display of your User-Generated Content;
          • your use or inability to use the Service;
          • any modification, price change, suspension or discontinuance of the Service;
          • the Service generally or the software or systems that make the Service available;
          • unauthorized access to or alterations of your transmissions or data;
          • statements or conduct of any third party on the Service;
          • any other user interactions that you input or receive through your use of the Service; or
          • any other matter relating to the Service.

          Our liability is limited whether or not we have been informed of the possibility of such damages, and even if a remedy set forth in this Agreement is found to have failed of its essential purpose. We will have no liability for any failure or delay due to matters beyond our reasonable control.

          https://docs.github.com/en/github/site-policy/github-terms-of-service#o-limitation-of-liability [github.com]

          By using Github, you indemnified Github for any liability that may arise from "facilitated copyright infringement" through an AI block box. You would have to prove intent or gross negligence recognized by others in the nascent ML field. Sure, there's a minuscule chance that a court may find Github liable in the future, but now we are extremely far out from the "Untested Legal Ground" claim (disregarding the nitpick that any situation could be considered "Untested Legal Ground" due to the unique configuration of matter in the universe in that moment).

          --
          Join the SDF Public Access UNIX System today!
          • (Score: 2) by c0lo on Saturday July 10 2021, @04:56AM

            by c0lo (156) Subscriber Badge on Saturday July 10 2021, @04:56AM (#1154544) Journal

            you indemnified Github for any liability that may arise from "facilitated copyright infringement" through an AI block box

            I don't see where I'm giving up my right to copyright, especially if Github were to be instrumental in infringing the copyright, no matter how they did it: by human operator or by running an algorithm. It is their AI that creates a derivative work from a copyrighted one, unless they receive an explicit license from the author to do it, there's no indemnification for them.

            Mind you, it's not only the GPLed software that they potentially infringe. MIT license says "you can do whatever you want as long as you reproduce this very license in your code" - if they strip the license in the process of AI-fycation (creating a derivative work), they are in trouble straight away.

            --
            https://www.youtube.com/watch?v=aoFiw2jMy-0 https://soylentnews.org/~MichaelDavidCrawford
      • (Score: 4, Informative) by http on Friday July 09 2021, @03:31AM (4 children)

        by http (1920) on Friday July 09 2021, @03:31AM (#1154176)

        Try reading what you posted a second time, paying careful attention:

        This license does not grant GitHub the right to sell Your Content. It also does not grant GitHub the right to otherwise distribute or use Your Content outside of our provision of the Service,

        Using hosted code to train an AI does not count towards providing the service of making a repository publicly available.

        --
        I browse at -1 when I have mod points. It's unsettling.
        • (Score: 2) by bradley13 on Friday July 09 2021, @05:34AM

          by bradley13 (3053) on Friday July 09 2021, @05:34AM (#1154199) Homepage Journal

          Sure it does. If they claim co-pilot is part of their service, then it's covered by the ToS.

          --
          Everyone is somebody else's weirdo.
        • (Score: 2) by darkfeline on Friday July 09 2021, @06:00AM (2 children)

          by darkfeline (1030) on Friday July 09 2021, @06:00AM (#1154206) Homepage
          Try read the ToS a first time, paying careful attention:

          The “Service” refers to the applications, software, products, and services provided by GitHub, including any Beta Previews.

          [...]

          “GitHub,” “We,” and “Us” refer to GitHub, Inc., as well as our affiliates, directors, subsidiaries, contractors, licensors, officers, agents, and employees.

          https://docs.github.com/en/github/site-policy/github-terms-of-service#a-definitions [github.com]

          It's ironic how accurate the subject of this thread is.

          --
          Join the SDF Public Access UNIX System today!
          • (Score: 2) by PiMuNu on Friday July 09 2021, @09:31AM

            by PiMuNu (3823) on Friday July 09 2021, @09:31AM (#1154230)

            > It's ironic how accurate the subject of this thread is.

            I think rather it means you have to be *ultra* careful when reading ToS in order to understand what it really means. Which is exactly why everyone just clicks "Accept".

          • (Score: 1, Insightful) by Anonymous Coward on Sunday July 11 2021, @09:11AM

            by Anonymous Coward on Sunday July 11 2021, @09:11AM (#1154801)

            Say at some point in time you read the ToS, review the services offered at that time, agree to the ToS based on those, have you given GitHub permission to use your content for those services or for any services they may think of later on? I don't think the latter is legal everywhere in the world. Perhaps it is in the US, but as far as I'm aware EU law is based on ideas on what is reasonable that don't include things like this, you're supposed to be able to oversee what you agree too, and an "anything we can think of in the future" clause, explicit or implicit, conflicts with that.

  • (Score: -1, Flamebait) by Anonymous Coward on Friday July 09 2021, @02:18AM

    by Anonymous Coward on Friday July 09 2021, @02:18AM (#1154159)

    Unenforceable ToS means shit.

    You are the sort that make people dislike autistics.

  • (Score: 0) by Anonymous Coward on Friday July 09 2021, @03:57AM (2 children)

    by Anonymous Coward on Friday July 09 2021, @03:57AM (#1154184)
    Actually, the license doesn't give them the right to take your code and copy-pasta it into a separate derivative work. Especially since such use is not necessary for providing the github service. Read it again.

    Though why anyone would use github, knowing it's going to be abused, is beyond me.

    • (Score: 2) by HiThere on Friday July 09 2021, @03:08PM (1 child)

      by HiThere (866) Subscriber Badge on Friday July 09 2021, @03:08PM (#1154310) Journal

      The questions then appears to be "Can they choose to add new features and call it part of the same service?". Certainly it wasn't a part of the service when most people agreed to it, but if their "new AI application" is offered by the same organization to those capable of using the prior service, can they define it as a part of the same service?

      It's not as if people never used code repositories as examples of how to do things before. They've just automated that as a new feature of their service. Or is that stretching things beyond where a court would agree?

      --
      Javascript is what you use to allow unknown third parties to run software you have no idea about on your computer.
      • (Score: 3, Insightful) by JoeMerchant on Friday July 09 2021, @03:44PM

        by JoeMerchant (3937) on Friday July 09 2021, @03:44PM (#1154338)

        Depends on the court, of course.

        What I wonder is: if this goes on for 5, 20, or 100 years without being tested in court, at what point is it immune from contest? I mean, of course as the practice spreads over time and various service providers there will be fewer and fewer courts willing to find against it, but Mickey Mouse made a mockery of fair use for nearly 100 years before the political climate denied him (and the industry as a whole) another copyright extension.

        --
        🌻🌻 [google.com]
  • (Score: 0) by Anonymous Coward on Friday July 09 2021, @06:54AM (2 children)

    by Anonymous Coward on Friday July 09 2021, @06:54AM (#1154216)

    I think there is a slight distinction to make. GitHub, under the terms of their license that you agreed to when you signed up, is perfectly within their rights to use your software in this way. The problem is that the people and organizations that receive these suggestions may not have the legal right to actually use them.

    • (Score: 2) by DannyB on Friday July 09 2021, @03:02PM (1 child)

      by DannyB (5839) Subscriber Badge on Friday July 09 2021, @03:02PM (#1154306) Journal

      What if someone else puts your GPL code on GitHub? YOU did not authorize this use of your code under non GPL terms. YOU now have cause to sue someone.

      --
      If we tell conservatives that the climate is transitioning, they will work to stop it.
      • (Score: 0) by Anonymous Coward on Friday July 09 2021, @09:44PM

        by Anonymous Coward on Friday July 09 2021, @09:44PM (#1154463)

        Even if someone else puts your code on GitHub in violation of their license, GitHub is still able to have reasonable reliance on the warranties of that user in order to cover their own ass, similar to every other hosting service. The real interesting part is that your remedy is a DMCA notice. Once they receive such a notice, they have to remove the code from everywhere under their control. What that also includes is the AI data sets and the trained AI output. Essentially, they would have to retrain the entire AI in order to be sure they got it all. Only after the DMCA notice would GitHub get anywhere near liability for themselves.

  • (Score: 2, Insightful) by Anonymous Coward on Friday July 09 2021, @07:06AM (1 child)

    by Anonymous Coward on Friday July 09 2021, @07:06AM (#1154217)

    But I think the main point of TFA is missing is that most likely the ones at risk are its users, who are copying into their code snippets of copyrighted code from other people, that might or might not require attribution or even redistribution of the entire codebase they are used into with a specific license. Considering this is a paid service, it is likely that it will be used for proprietary software that is quite unlikely to satisfy the terms of open source licenses.
    So github is turning their customers into copyright infringers, and might be sued for facilitating and making a profit off it (like piratebay or megaupload). They are just relying on the fact that its other users, those providing the content, are unlikely to have proof the infringement happened.

    • (Score: 2) by HiThere on Friday July 09 2021, @03:11PM

      by HiThere (866) Subscriber Badge on Friday July 09 2021, @03:11PM (#1154311) Journal

      Except that just as with music, individuals are generally not viable targets for an expensive suite. So a different approach will be taken.

      --
      Javascript is what you use to allow unknown third parties to run software you have no idea about on your computer.
  • (Score: 1, Insightful) by Anonymous Coward on Friday July 09 2021, @11:07AM (1 child)

    by Anonymous Coward on Friday July 09 2021, @11:07AM (#1154242)

    By using GIthub, you agree to their ToS and gave them permission to do this.

    Yes, and so does the GPL license. And the GPL license then has provisions that the code cannot be re-licensed under terms not compatible with GPL.

    If their training regurgitates GPL code, then that code is still GPL. It doesn't matter who or what plagiarized it.

    The law is the law, no matter what TOS of some random company (or even the great Microsoft) you agree with.

    • (Score: 2) by HiThere on Friday July 09 2021, @03:13PM

      by HiThere (866) Subscriber Badge on Friday July 09 2021, @03:13PM (#1154315) Journal

      Additionally, if they offer GPL code, they are obliged to include the GPL license.

      Yes, they have the right to use and share that code, but they don't have the right to hide the license.

      --
      Javascript is what you use to allow unknown third parties to run software you have no idea about on your computer.
  • (Score: 4, Insightful) by DannyB on Friday July 09 2021, @03:01PM (6 children)

    by DannyB (5839) Subscriber Badge on Friday July 09 2021, @03:01PM (#1154304) Journal

    John writes a GPL program.

    Later, Jane, without John's knowledge, puts this program's source code on GitHub.

    John did not authorize Microsoft / GitHub to publish this code under any terms other than the GPL. This includes when snippets of this GPL code are inserted into proprietary closed source projects because the developer is using CoPilot to auto-suggest snippets.

    I think John has cause to sue someone. Jane, GitHub, Microsoft, a developer using CoPilot who now has GPL code locked up in their proprietary code.

    This also brings up another important point. If you work on a proprietary product (as I do) then you better be sure you have a proper license for every last bit of code that is in your project which you didn't write yourself. If you include a commercial library, have a proper commercial license and are in compliance with it. If you have open source in your project, make sure you are in full compliance with the license. (eg, Apache 2, BSD, MIT, etc)

    In the Java world there is an embarrassing amount of high quality third party libraries. The licenses on these are very commercial friendly. But ALWAYS review the license. Get management approval.

    The commercial friendly licensing is because the users of these are large commercial interests writing large commercial closed source Java programs. And many of the same corporations that consume this open source also sponsor various open source Java projects. Why does Red Hat spend significant resources developing a state of the art Garbage Collector (Shenandoah) for Java? Because their biggest customers are on Java. Why does IBM and Microsoft invest in Java? Same reason.

    CoPilot brings a whole new vector where unlicensed code can make its way in to your source code base. This is similar to, but worse than copy/pasting some code you googled from, say, Stack Overflow. If you don't have a license for it, then don't copy/paste it in. Understand it, and then do what it does in your own code.

    --
    If we tell conservatives that the climate is transitioning, they will work to stop it.
    • (Score: 0) by Anonymous Coward on Friday July 09 2021, @04:31PM

      by Anonymous Coward on Friday July 09 2021, @04:31PM (#1154360)

      But its AI, as in "Intelligent": its not copy-pasting code - it "learned" the code and is using its "knowledge" to synthesize new code on the spot!

    • (Score: 2) by darkfeline on Friday July 09 2021, @07:39PM (3 children)

      by darkfeline (1030) on Friday July 09 2021, @07:39PM (#1154421) Homepage

      John does not have cause to sue for monetary damages as likely he cannot demonstrate any monetary damages. He can demand cease and desist for Jane to stop unauthorized re-licensing of his code, if Jane is still doing so. He can demand Github remove the code, as the copyright holder did not agree to Github's ToS. Microsoft, as the owner of Github as a subsidiary, is not involved at all. I highly doubt John would able to prove a developer somewhere got the exact same code that he wrote from CoPilot, and enough of it to constitute copyright infringement. If he could, then he could also demand said developer to cease and desist.

      For some reason, people seem to be assuming CoPilot is straight up copying sections of code fed into it. That's not how AI trained on broad datasets work, unless the code is generic enough that copyright would no longer be applicable in the first place (e.g. a function that adds two arguments together).

      --
      Join the SDF Public Access UNIX System today!
      • (Score: 0) by Anonymous Coward on Saturday July 10 2021, @04:35AM

        by Anonymous Coward on Saturday July 10 2021, @04:35AM (#1154541)

        John does not have cause to sue for monetary damages as likely he cannot demonstrate any monetary damages.

        In the United States, John may choose to seek statutory damages which (if successful) entitles him to monetary relief of no less than $750 per work infringed and does not require him to demonstrate any actual damages whatsoever. John must have registered his copyright prior to the alleged infringement in order to be eligible for statutory relief.

        Other jurisdictions may have similar mechanisms.

      • (Score: 0) by Anonymous Coward on Saturday July 10 2021, @04:55AM

        by Anonymous Coward on Saturday July 10 2021, @04:55AM (#1154543)

        For some reason, people seem to be assuming CoPilot is straight up copying sections of code fed into it. That's not how AI trained on broad datasets work, unless the code is generic enough that copyright would no longer be applicable in the first place (e.g. a function that adds two arguments together).

        GitHub's own research [github.com] says "once, GitHub Copilot suggested starting an empty file with something it had even seen more than a whopping 700,000 different times during training -- that was the GNU General Public License."

        So yes it does sound like this tool is absolutely capable of straight up regurgitating significant amounts of nontrivial and copyrightable text, verbatim, that was part of its training set.

      • (Score: 0) by Anonymous Coward on Sunday July 11 2021, @09:29AM

        by Anonymous Coward on Sunday July 11 2021, @09:29AM (#1154802)

        If GitHub (Microsoft) argues there is no problem with copyright there is a simple solution to make everybody happy: train the AI on the huge proprietary code base Microsoft owns.

        It's supposed to be code developed by the very best developers in the world, according to what Microsofties told me during a conversion project to their platform I was part of once, so it must be excellent, and without any copyright problems there is no reason not to use it.

    • (Score: 0) by Anonymous Coward on Friday July 09 2021, @09:26PM

      by Anonymous Coward on Friday July 09 2021, @09:26PM (#1154460)

      i want my cut of github and MS's hide! yeehaaaawwww!