Stories
Slash Boxes
Comments

SoylentNews is people

posted by martyb on Friday July 09 2021, @12:52AM   Printer-friendly
from the we-violate-all-open-source-licenses-equally dept.

GitHub’s automatic coding tool rests on untested legal ground:

The Copilot tool has been trained on mountains of publicly available code

[...] When GitHub announced Copilot on June 29, the company said that the algorithm had been trained on publicly available code posted to GitHub. Nat Friedman, GitHub’s CEO, has written on forums like Hacker News and Twitter that the company is legally in the clear. “Training machine learning models on publicly available data is considered fair use across the machine learning community,” the Copilot page says.

But the legal question isn’t as settled as Friedman makes it sound — and the confusion reaches far beyond just GitHub. Artificial intelligence algorithms only function due to massive amounts of data they analyze, and much of that data comes from the open internet. An easy example would be ImageNet, perhaps the most influential AI training dataset, which is entirely made up of publicly available images that ImageNet creators do not own. If a court were to say that using this easily accessible data isn’t legal, it could make training AI systems vastly more expensive and less transparent.

Despite GitHub’s assertion, there is no direct legal precedent in the US that upholds publicly available training data as fair use, according to Mark Lemley and Bryan Casey of Stanford Law School, who published a paper last year about AI datasets and fair use in the Texas Law Review.

[...] And there are past cases to support that opinion, they say. They consider the Google Books case, in which Google downloaded and indexed more than 20 million books to create a literary search database, to be similar to training an algorithm. The Supreme Court upheld Google’s fair use claim, on the grounds that the new tool was transformative of the original work and broadly beneficial to readers and authors.

Microsoft’s GitHub Copilot Met with Backlash from Open Source Copyright Advocates:

GitHub Copilot system runs on a new AI platform developed by OpenAI known as Codex. Copilot is designed to help programmers across a wide range of languages. That includes popular scripts like JavaScript, Ruby, Go, Python, and TypeScript, but also many more languages.

“GitHub Copilot understands significantly more context than most code assistants. So, whether it’s in a docstring, comment, function name, or the code itself, GitHub Copilot uses the context you’ve provided and synthesizes code to match. Together with OpenAI, we’re designing GitHub Copilot to get smarter at producing safe and effective code as developers use it.”

One of the main criticisms regarding Copilot is it goes against the ethos of open source because it is a paid service. However, Microsoft would arguably justify this by saying the resources needed to train the AI are costly. Still, the training is problematic for some people because they argue Copilot is using snippets of code to train and then charging users.

Is it fair use to auto-suggest snippets of code that are under an open source copyright license? Does that potentially bring your code under that license by using Copilot?

One glorious day code will write itself without developers developers.

See Also:
CoPilot on GitHub
Twitter: GitHub Support just straight up confirmed in an email that yes, they used all public GitHub code, for Codex/Copilot regardless of license.
Hacker News: GitHub confirmed using all public code for training copilot regardless license
OpenAI warns AI behind GitHub’s Copilot may be susceptible to bias


Original Submission

 
This discussion has been archived. No new comments can be posted.
Display Options Threshold/Breakthrough Mark All as Read Mark All as Unread
The Fine Print: The following comments are owned by whoever posted them. We are not responsible for them in any way.
  • (Score: 2) by DannyB on Friday July 09 2021, @03:13PM (7 children)

    by DannyB (5839) Subscriber Badge on Friday July 09 2021, @03:13PM (#1154316) Journal

    There is nothing wrong with boilerplate per se.

    An IDE may insert common templates for you, such as a for or while loop. The entire structure is there. If I then rename the variable in the for loop, all references to that local variable within the loop are renamed before I even start filling in the body.

    for ( int i = 0; i < 30; i++ ) {
          . . . insert body hear . . .
    }

    If I change "i" to "z" (by doing ctrl-shift-R to rename variable), then all of the i's within the loop change to z's. Just within the template. It is an intelligent rename, not a stupid search/replace. It is based on the compiler's understanding of the scope of variable i. The compiler is deeply integrated with the editor. That's what makes an IDE powerful. The editor understands code on a conceptual level, not just as characters on a page. Or editors that use silly regex tricks to color code the source. Instead color code the source based on the compiler's understanding of the code in the editor.

    What's wrong with boilerplate? You're using boilerplate every time you write various simple structures. while loop. for loop. if-then-else. A single keystroke should generate the template so you can tab to the various sections of the structure and fill them in without extraneous typing.

    --
    Trump is a poor man's idea of a rich man, a weak man's idea of a strong man, and a stupid man's idea of a smart man.
    Starting Score:    1  point
    Karma-Bonus Modifier   +1  

    Total Score:   2  
  • (Score: 2) by HiThere on Friday July 09 2021, @03:25PM (4 children)

    by HiThere (866) Subscriber Badge on Friday July 09 2021, @03:25PM (#1154325) Journal

    Well, ideally there would be a less verbose way to write that, such that the code was both more compact and easier to read and understand. But simple boilerplate like that if already pretty close. Think what the same statements would be like in assembler...or even just rewrite it as a while loop.

    That said, everything you use a templated class you're using automated boilerplate. Every time you use an inline function you're using custom boilerplate. Etc.

    But "non-standardized boilerplate" is annoying. I've been known to write a function to deal with it even at the cost of some efficiency.

    --
    Javascript is what you use to allow unknown third parties to run software you have no idea about on your computer.
    • (Score: 2) by DannyB on Friday July 09 2021, @03:33PM (3 children)

      by DannyB (5839) Subscriber Badge on Friday July 09 2021, @03:33PM (#1154331) Journal

      A good IDE lets you create your own templates. (eg, boilerplate) If you have some construction that you frequently type, you can make it a template, complete with variables. It works the same. A keystroke generates the template at the point where you are typing. You can tab through the variables to name them differently as you wish, but renaming one variable renames it everywhere within that template -- until you start filling in the body, if it has a body.

      Some people don't like the noise and complexity of modern IDEs and prefer a simple text editor.

      Some people don't like the noise and complexity of a backhoe and prefer to dig a ditch using a shovel. And much more worser is that a backhow requires a bit of learning to use. Best to stick to the shovel.

      --
      Trump is a poor man's idea of a rich man, a weak man's idea of a strong man, and a stupid man's idea of a smart man.
      • (Score: 2) by hendrikboom on Friday July 09 2021, @05:37PM (1 child)

        by hendrikboom (1125) Subscriber Badge on Friday July 09 2021, @05:37PM (#1154388) Homepage Journal

        And sometime the language has features that allow the compiler to expand the boilerplate instead of the editor. Yes, complete with proper handling of bound variables.

        • (Score: 2) by DannyB on Friday July 09 2021, @08:43PM

          by DannyB (5839) Subscriber Badge on Friday July 09 2021, @08:43PM (#1154437) Journal

          The compiler and language may already have ways of hiding various scopes of variables. The problem is that if I want to change the name of variables A and B to be named X and Y, I don't want to have to go change every instance of them by hand. How would a compiler let you do that to your source code?

          In an IDE, I can click on a variable, ctrl-shift-R, then rename that variable, and the IDE precisely and exactly changes all occurrences of that variable and not any other identifiers that happen to have the same names but in other scopes or contexts. (like both a function, a variable, a class and a type all named A.) And if that variable is visible in other parts of the project, other files, it changes them there too! It is not some dumb search/replace. It is based on the compiler's understanding of the scope and visibility of that identifier throughout the entire project. The compiler and editor are deeply integrated.

          I happen to use Eclipse. No matter what language I'm editing, the editor is integrated with the proper compiler or language server.

          --
          Trump is a poor man's idea of a rich man, a weak man's idea of a strong man, and a stupid man's idea of a smart man.
      • (Score: 2) by HiThere on Friday July 09 2021, @06:04PM

        by HiThere (866) Subscriber Badge on Friday July 09 2021, @06:04PM (#1154398) Journal

        If the IDE/editor ends up writing verbose text via a boiler-plate template, it's nearly as hard to read a month later as if you had written it by hand. And if you need to customize the variables...well, I've been known to miss some of those, or to do a replace that wasn't limited to the appropriate areas of text. Usually that causes an immediate error, but sometimes it's quite difficult to track down.

        This I feel to be the appropriate use-case for templated/generic classes. But, of course, those can't handle all the cases that a template substitution can. OTOH, custom macros are REALLY dangerous. Used appropriately, they're very useful. Over used, or used inappropriately, and they render the program text nearly unreadable.

        Back in the day (Fortran IV days) I once was really attracted to macro templated code. (Look at Mortran https://en.wikipedia.org/wiki/Mortran [wikipedia.org] or DYSTAL https://pubmed.ncbi.nlm.nih.gov/14284294/ [nih.gov] though DYSTAL was really more of a library) But they rendered the code unreadable by anyone else, and after awhile unreadable by me.

        --
        Javascript is what you use to allow unknown third parties to run software you have no idea about on your computer.
  • (Score: 1) by shrewdsheep on Friday July 09 2021, @03:54PM (1 child)

    by shrewdsheep (5215) on Friday July 09 2021, @03:54PM (#1154342)

    To some degree, it is a matter of opinion, so nothing really to disagree about.

    As the topic seems to be somewhat sensitive, a definition would be appropriate as a starting point: if the information expressed in N lines of code can be given in N/K of code, I call the code boilerplate. For me, K is somewhere between 5 and 10. With this definition, your for loop would not count for me. My programming style focuses very much around the do not repeat yourself principle and small units of code. For me, a function containing 20 lines of code is long (well in high-level languages like R/python/perl, I do not manage in C++). The examples given for the Copilot, I would have factored out into smaller functions in most cases.

    The Copilot goes the wrong way around, IMO. Instead of suggesting boilerplate code, the boilerplate code should be avoided altogether. One very concrete example is packaging where many tools exist to create skeletons (be it R/python/perl/whatever) for you. This is the wrong way round. The information needed is just the code (being inline documented) and a single dictionary containing required meta-information. Basically adding 10-20 lines of description to an existing code directory should allow you to create a package. From this the entire packaging can proceed. The actual package is always temporary code.

    • (Score: 2) by DannyB on Friday July 09 2021, @08:50PM

      by DannyB (5839) Subscriber Badge on Friday July 09 2021, @08:50PM (#1154441) Journal

      I think everyone is in favor of making things simple and less verbose. As simple as possible, but not any simpler.

      Now how simple it should be depends on how big your projects are. Java is used for very large source code bases. Many diverse teams may write many different modules or libraries that end up in a single executable.

      I think most languages could benefit from some form of IDE assistance to help you type out repetitive templates of code. I used the for() loop for an example, because the typical pattern of a for loop is to have a single variable that is referenced three times. You should only have to type in that variable name once, not three times. When you change the first occurrence of the variable, it should change the others, keystroke by keystroke.

      The DRY principle is something I strongly embrace. But in a for() loop, you typically repeat the variable name three times. Or more times if you reference that variable within the body of the loop and not just the initialization, increment, and test of the loop construct. It sure is nice if I can change that variable name one time and have it change everywhere it is used within that loop construct.

      I'm not against something like CoPilot, if it works well. But the copyright an license issues are a genuine concern.

      --
      Trump is a poor man's idea of a rich man, a weak man's idea of a strong man, and a stupid man's idea of a smart man.