GitHub Copilot may steer Microsoft into a copyright lawsuit:
GitHub Copilot – a programming auto-suggestion tool trained from public source code on the internet – has been caught generating what appears to be copyrighted code, prompting an attorney to look into a possible copyright infringement claim.
On Monday, Matthew Butterick, a lawyer, designer, and developer, announced he is working with Joseph Saveri Law Firm to investigate the possibility of filing a copyright claim against GitHub. There are two potential lines of attack here: is GitHub improperly training Copilot on open source code, and is the tool improperly emitting other people's copyrighted work – pulled from the training data – to suggest code snippets to users?
Butterick has been critical of Copilot since its launch. In June he published a blog post arguing that "any code generated by Copilot may contain lurking license or IP violations," and thus should be avoided.
That same month, Denver Gingerich and Bradley Kuhn of the Software Freedom Conservancy (SFC) said their organization would stop using GitHub, largely as a result of Microsoft and GitHub releasing Copilot without addressing concerns about how the machine-learning model dealt with different open source licensing requirements.
Copilot's capacity to copy code verbatim, or nearly so, surfaced last week when Tim Davis, a professor of computer science and engineering at Texas A&M University, found that Copilot, when prompted, would reproduce his copyrighted sparse matrix transposition code.
Asked to comment, Davis said he would prefer to wait until he has heard back from GitHub and its parent Microsoft about his concerns.
In an email to The Register, Butterick indicated there's been a strong response to news of his investigation.
"Clearly, many developers have been worried about what Copilot means for open source," he wrote. "We're hearing lots of stories. Our experience with Copilot has been similar to what others have found – that it's not difficult to induce Copilot to emit verbatim code from identifiable open source repositories. As we expand our investigation, we expect to see more examples.
"But keep in mind that verbatim copying is just one of many issues presented by Copilot. For instance, a software author's copyright in their code can be violated without verbatim copying. Also, most open-source code is covered by a license, which imposes additional legal requirements. Has Copilot met these requirements? We're looking at all these issues."
Spokespeople for Microsoft and GitHub were unable to comment for this article. However, GitHub's documentation for Copilot warns that the output may contain "undesirable patterns" and puts the onus of intellectual property infringement on the user of Copilot. That is to say, if you use Copilot to auto-complete code for you and you get sued, you were warned. That warning implies that the potential for Copilot to produce copyrighted code was not unanticipated.
[...] "Obviously, it's ironic that GitHub, a company that built its reputation and market value on its deep ties to the open source community, would release a product that monetizes open source in a way that damages the community. On the other hand, considering Microsoft's long history of antagonism toward open source, maybe it's not so surprising. When Microsoft bought GitHub in 2018, a lot of open source developers – me included – hoped for the best. Apparently that hope was misplaced."
(Score: 4, Touché) by tekk on Saturday October 22 2022, @04:14PM
Come on, it's pretty obvious isn't it?
All Copilot has to do is only scrape repos with approved licenses then automatically generate and check-in a multi-megabyte file containing attribution information for every single github repo it used in its training data :^)