Stories
Slash Boxes
Comments

SoylentNews is people

posted by martyb on Wednesday March 21 2018, @04:35PM   Printer-friendly
from the rigid-coding-guidelines++ dept.

Anonymous coders can be identified using stylometry and machine learning techniques applied to executable binaries:

Source code stylometry – analyzing the syntax of source code for clues about the author – is an established technique used in digital forensics. As the US Army Research Laboratory (ARL) puts it, "Stylometry research has proven that anonymous code contributors can be de-anonymized to reveal the original author, provided the author has published code before."

The technique can help identify virus makers as well as unmask the creators of anti-censorship tools and other outlawed programs. It has the potential to pierce the privacy that many programmers assume they have.

Source code is designed to be human-readable, but binaries – typically produced by compiling or assembling source code – have fewer characteristics that may suggest authorship. Toolchains can be instructed to strip out variable names, function names and other symbols and metadata – which may say something about the author – and alter the structure of code through optimization.

Nonetheless, the researchers – Aylin Caliskan, Fabian Yamaguchi, Edwin Dauber, Richard Harang, Konrad Rieck, Rachel Greenstadt and Arvind Narayanan – building on work described in a 2011 paper, demonstrate that binary files can be analyzed using machine-learning and stylometric techniques.

If you want to remain an anonymous coder, you'd better not contribute anything under your own name publicly:

When Coding Style Survives Compilation: De-anonymizing Programmers from Executable Binaries (arXiv:1512.08546 [cs.CR])

We evaluate our approach on data from the Google Code Jam, obtaining attribution accuracy of up to 96% with 100 and 83% with 600 candidate programmers. We present an executable binary authorship attribution approach, for the first time, that is robust to basic obfuscations, a range of compiler optimization settings, and binaries that have been stripped of their symbol tables. We perform programmer de-anonymization using both obfuscated binaries, and real-world code found "in the wild" in single-author GitHub repositories and the recently leaked Nulled.IO hacker forum. We show that programmers who would like to remain anonymous need to take extreme countermeasures to protect their privacy.


Original Submission

This discussion has been archived. No new comments can be posted.
Display Options Threshold/Breakthrough Mark All as Read Mark All as Unread
The Fine Print: The following comments are owned by whoever posted them. We are not responsible for them in any way.
(1)
  • (Score: 0) by Anonymous Coward on Wednesday March 21 2018, @04:47PM

    by Anonymous Coward on Wednesday March 21 2018, @04:47PM (#656187)

    And when they arrest the Dread Coder Roberts, we'll finally have our pure anarcho-capitalist utopia, in a trailer park in the Florida panhandle.

    Good times!

  • (Score: 2) by meustrus on Wednesday March 21 2018, @05:11PM (8 children)

    by meustrus (4961) on Wednesday March 21 2018, @05:11PM (#656206)

    It's not that hard to adopt a different coding style. You just need to do it consciously.

    Remember all of the stupid holy wars where you have to pick a side day to day? Pick the other side. Use tabs instead of spaces. Adopt polish notation. Make your code more or less functional.

    Your mileage may vary - I haven't tried this - but I figure that if you end up with a strong urge to "fix" the code, that code is probably not able to be traced back to you. At least until you start releasing code in your new alternative style.

    --
    If there isn't at least one reference or primary source, it's not +1 Informative. Maybe the underused +1 Interesting?
    • (Score: 2) by Snotnose on Wednesday March 21 2018, @05:52PM (5 children)

      by Snotnose (1623) on Wednesday March 21 2018, @05:52PM (#656239)

      Tabs vs spaces goes away during compilation. Strip out symbols and CamelCase goes away. Polish notation goes away. Indentation goes away.

      I'm curious how they do it. In C I use jump tables of function pointers more than most people, but that's just because I know the syntax and write state machines fairly often. So that may be my identifier, assuming they're looking at state machines.

      --
      My ducks are not in a row. I don't know where some of them are, and I'm pretty sure one of them is a turkey.
      • (Score: 2, Insightful) by Anonymous Coward on Wednesday March 21 2018, @07:06PM (2 children)

        by Anonymous Coward on Wednesday March 21 2018, @07:06PM (#656289)

        I'm curious how they do it.

        TFS says "machine learning". In other words, no-one really knows how, but it looks for some kind of identifying patterns.

        • (Score: 0) by Anonymous Coward on Wednesday March 21 2018, @11:24PM (1 child)

          by Anonymous Coward on Wednesday March 21 2018, @11:24PM (#656384)

          TFS says "machine learning". In other words, no-one really knows how, but it looks for some kind of identifying patterns.

          The bonus is, it's enough to _look like_ it looks for some kind of identifying patterns, to nontechnical paying customers (high bosses in charge of censorship). Given the principially unverifiable nature of this kind of "proof", any number of false positives is perfectly acceptable. The higher the better, probably - the easier to convict whomever is expedient.

          • (Score: 2) by Wootery on Thursday March 22 2018, @09:50AM

            by Wootery (2341) on Thursday March 22 2018, @09:50AM (#656534)

            Given the principially unverifiable nature of this kind of "proof"

            What? It's just a classifier, no? It can be tested like any classifier.

      • (Score: 2) by pipedwho on Thursday March 22 2018, @03:34AM

        by pipedwho (2032) on Thursday March 22 2018, @03:34AM (#656467)

        That's also how I code in C. After a run through their ML algorithm, they might come up with proof that you are me.

      • (Score: 2) by Wootery on Thursday March 22 2018, @04:44PM

        by Wootery (2341) on Thursday March 22 2018, @04:44PM (#656676)

        goes away

        Indeed. The 'before' and 'after' C programs would still be alpha equivalent [c2.com] and would presumably generate identical object code, as you say. (Well, renaming functions and globals might cause strings in the object files to change, I suppose.)

    • (Score: 3, Funny) by Anonymous Coward on Wednesday March 21 2018, @09:53PM (1 child)

      by Anonymous Coward on Wednesday March 21 2018, @09:53PM (#656357)

      I copy all my code from stack overflow. I guess about 20 random programmers would be incorrectly identified as the authors. XD

      • (Score: 0) by Anonymous Coward on Thursday March 22 2018, @08:43AM

        by Anonymous Coward on Thursday March 22 2018, @08:43AM (#656522)

        So, what have you been doing since we fired you?

  • (Score: 0, Offtopic) by cocaine overdose on Wednesday March 21 2018, @05:14PM (6 children)

    > We advance the state of executable binary authorship attribution.

    Hey hoh, it's absolutely nothing. If you give me 500 repos and 500 bins, it becomes trivial to match them up. Postdocs can service my balls and phallus. Rossbaum, unlike our femme-dominated "research team," did real work with real implications. His methods would allow any actor with sufficient resources, to comb through the entirety of all publish compiled code and group them under likely authors. These girls (and one HAPA) just link binaries to source codes. How is this fucking useful? Rossbutt could do bin-to-bin comparisons, and any downie with a hex editor can tell which bin goes to which source code. Surprise surprise, after you tear apart the fluff they tried to hide their "methods" under, it turns out to be dog shit unimportant.

    Publish your burpsuite plugins under different compilers and optimizations.

    And unnulled.virginity can sniff my stinky protein farts. One step up from Hack Forums, twenty steps down from Kickass, and about three hundred and fifty thousand steps down from not being 13 year old script kiddie arabs with pozzed-up blog pages about their "exploits." The real biz is done over XMPP. Start with exploit.im then go tell all your #anonsec middle school buddies about it.

    • (Score: 0) by Anonymous Coward on Wednesday March 21 2018, @05:22PM (1 child)

      by Anonymous Coward on Wednesday March 21 2018, @05:22PM (#656218)

      U not 4unny br0.

      my hax0r be s0 l33t!

      U dont no.

      that is all.

    • (Score: 0) by Anonymous Coward on Wednesday March 21 2018, @06:04PM (3 children)

      by Anonymous Coward on Wednesday March 21 2018, @06:04PM (#656251)

      Didn't read lol

      • (Score: 3, Touché) by cocaine overdose on Wednesday March 21 2018, @06:27PM (2 children)

        Made you look.
        • (Score: 0) by Anonymous Coward on Wednesday March 21 2018, @11:44PM (1 child)

          by Anonymous Coward on Wednesday March 21 2018, @11:44PM (#656389)

          The spoiler tag never works for me, can you spoil the spoiler?

  • (Score: -1, Offtopic) by Anonymous Coward on Wednesday March 21 2018, @05:14PM

    by Anonymous Coward on Wednesday March 21 2018, @05:14PM (#656211)

    this is a good example why i don't pay federal taxes.

  • (Score: 3, Interesting) by captain normal on Wednesday March 21 2018, @05:28PM

    by captain normal (2205) on Wednesday March 21 2018, @05:28PM (#656223)

    Does this mean we'll soon know who Satoshi Nakamoto is?

    --
    When life isn't going right, go left.
  • (Score: 3, Insightful) by Virindi on Wednesday March 21 2018, @05:31PM (2 children)

    by Virindi (3484) on Wednesday March 21 2018, @05:31PM (#656226)

    The technique can help...unmask the creators of anti-censorship tools and other outlawed programs.

    Except computer code is the very essence of free speech. It is a mathematical description of a way to complete a task. Governments have no right to "outlaw" certain types of programs, only ways of using them.

    Hearing people talk about "outlawed code" as though it is only a natural thing that telling others how to do certain tasks be a crime, bother me. Such an attitude should not be normalized in society.

    • (Score: 4, Informative) by khallow on Wednesday March 21 2018, @05:54PM (1 child)

      by khallow (3766) Subscriber Badge on Wednesday March 21 2018, @05:54PM (#656242) Journal

      Governments have no right to "outlaw" certain types of programs

      They merely have the power to do so.

      • (Score: 2) by fyngyrz on Wednesday March 21 2018, @08:49PM

        by fyngyrz (6567) on Wednesday March 21 2018, @08:49PM (#656331) Journal

        Governments have no right to "outlaw" certain types of programs

        They merely have the power to do so.

        "Rights" without power behind them are merely ideas, not rights.

        The only right worthy of the name is one that others will exert force to ensure its reality.

        The US government arm of such force is one that, theoretically, proceeds from authorization from the constitution on down, through power enabled by constitutionally compliant legislation. Unfortunately, as the USG rarely bothers with the authorization step these days, particularly as it relates to any supposed rights of the citizens, it's down to nothing but exertion of power with few, or no, checks.

        So, again: a "right" without power behind it, as far as the USG is concerned, is no right at all.

        We're not to a place where our citizens are willing, or able, to assert power to back any rights they might want, either. Although I do think I am sensing a bit more hot air being spewed lately.

  • (Score: 3, Informative) by jimtheowl on Wednesday March 21 2018, @06:16PM (4 children)

    by jimtheowl (5929) on Wednesday March 21 2018, @06:16PM (#656259)
    Direct link: https://arxiv.org/pdf/1512.08546.pdf [arxiv.org]

    It is relatively easy to distinguish differences in a small set. Increase that set by a magnitude or two, it becomes much more difficult to distinguish specific elements.
    I believe this is why they work from such a low sample number (100 programmers); it makes the results much more impressive.

    Good to be reminded of Radare http://rada.re/r/ [rada.re] though, and overall, quite an interesting paper.
    • (Score: 2) by bob_super on Wednesday March 21 2018, @06:34PM (2 children)

      by bob_super (1357) on Wednesday March 21 2018, @06:34PM (#656272)

      The stupid really shows up when you realize that a lot of the most "popular" nasties flowing around the web, are just derivative works from a flaw's proof of concept, or a particularly good prior worm.
      Run this against modern attacks, and you'll find a really high probability that the NSA wrote them.

      • (Score: 2) by jimtheowl on Wednesday March 21 2018, @07:06PM (1 child)

        by jimtheowl (5929) on Wednesday March 21 2018, @07:06PM (#656290)
        I don't know about nasties, but there seems to be an ever increasing number of modern 'programmers' who are only good for browsing the net and gluing pieces of existing code together. I'm all for reuse, but there is something about being able to write original code when required.

        I wonder if their method would catch that.
        • (Score: 5, Funny) by Anonymous Coward on Wednesday March 21 2018, @07:10PM

          by Anonymous Coward on Wednesday March 21 2018, @07:10PM (#656291)

          "After running the analyzer, we found out that 80% of the world's code was written by some guy named Dennis Ritchie."

    • (Score: 2) by c0lo on Thursday March 22 2018, @02:24AM

      by c0lo (156) Subscriber Badge on Thursday March 22 2018, @02:24AM (#656448) Journal

      From the same source:

      We emphasize that research challenges remain before programmer de-anonymization from executable binaries is fully ready for practical use. For example, programs may be authored by multiple programmers and may have gone through encryption.

      --
      https://www.youtube.com/watch?v=aoFiw2jMy-0 https://soylentnews.org/~MichaelDavidCrawford
  • (Score: 4, Interesting) by DannyB on Wednesday March 21 2018, @06:24PM (4 children)

    by DannyB (5839) Subscriber Badge on Wednesday March 21 2018, @06:24PM (#656264) Journal

    If you have a black box that can identify which of programmers A, B, or C wrote code X, with certain probabilities, then it seems that you could use that black box to teach another box how to apply code transformations upon X (written by A) to cause X to have a higher probability match with either B or C instead of A.

    Some transformations may be re-ordering of instructions with no dependencies. De-optimizing, and letting a compiler's optimizer re-applying standard optimizations in a standard way. Identifying how to break apart large functions into smaller. Or combine smaller into larger. Changing parameter ordering to a different style.

    What I'm getting at is to, in some way, teach another machine to understand what the first machine is recognizing. That is, what transformations work to alter the match probabilities. If both A's version of a program and B's version of a program are correct, then there may exist some transformation applications between A's code and B's code.

    Now some questions of style may be what algorithm was used. A used a heap sort function, while B used a quick sort function. So recognition of common algorithms may also be required. It gets down to what are the characteristics that make A's code distinguishable from B's code. When talking about binary, it is clearly not spelling, spacing, indentation, curly brace style, etc.

    Maybe more or less frequent use of intermediate variables. But I would think that transformation to SSA and optimizations applied would eliminate or obscure much of that from the source.

    What are the identifiable features that the machine learning is distinguishing? It's not like you can ask it. And it's not clearly "coded" somewhere in the trained result. But can another machine learn by randomly applying transformations to working code, to get the first machine to indicate different probability matches to each of a pool of potential authors?

    Assuming the answer to that is Yes. Imagine the possibilities! (Sort of like pr0n deep fakes.)

    --
    To transfer files: right-click on file, pick Copy. Unplug mouse, plug mouse into other computer. Right-click, paste.
    • (Score: 0) by Anonymous Coward on Wednesday March 21 2018, @09:52PM

      by Anonymous Coward on Wednesday March 21 2018, @09:52PM (#656356)

      Kinda reminds of of Jstylo & anonymouth. (author detection & camouflaging for text)

      https://psal.cs.drexel.edu/index.php/JStylo-Anonymouth [drexel.edu]

      No idea how well or badly it works but I like the idea.

    • (Score: 0) by Anonymous Coward on Thursday March 22 2018, @03:03AM (2 children)

      by Anonymous Coward on Thursday March 22 2018, @03:03AM (#656458)

      The answer is yes. You'd use a genetic algorithm (GA) to mess with the binary. Or you can decompile the binary, run a GA on the code, recompile, then retest. It'll be really slow without optimizing the GA, but there are many ways to optimize it.

      The current possibilities are zero because no one currently does anything with the information. This tech is only useful to the police and people looking for false flag operations. What good would it be to have Photoshop whose binaries look like they were written by someone else? No one cares. You can break hashes by modifying the binary randomly, no need to make it mimic something else. There isn't any security software which whitelists software based on developer fingerprints.

      If the police state gets super bad then such software may become useful, but by then I'd expect advanced obfuscation tech to be better.

      • (Score: 2) by maxwell demon on Thursday March 22 2018, @12:35PM (1 child)

        by maxwell demon (1608) on Thursday March 22 2018, @12:35PM (#656566) Journal

        Note that if such programmer identification are used in forensics, such a code transformation program could also be used to create false evidence against someone: Analyze code written by the target, then take some malware and optimize it to be "recognized" as the target's work by the algorithm. Spread the malware a little bit, then run the analysis on it (with the well-known result) and arrest the target who has been "identified" as the author.

        --
        The Tao of math: The numbers you can count are not the real numbers.
        • (Score: 2) by DannyB on Thursday March 22 2018, @02:29PM

          by DannyB (5839) Subscriber Badge on Thursday March 22 2018, @02:29PM (#656605) Journal

          Exactly what I had in mind when I said: Imagine the possibilities!

          --
          To transfer files: right-click on file, pick Copy. Unplug mouse, plug mouse into other computer. Right-click, paste.
  • (Score: 5, Insightful) by bradley13 on Wednesday March 21 2018, @07:16PM (2 children)

    by bradley13 (3053) on Wednesday March 21 2018, @07:16PM (#656293) Homepage Journal

    I used to do research in exactly this type of machine learning. FWIW, the field has not advanced very much in the past 25 years or so. There are two keys to success:

    1. Defining the features that can be used for the learning. This is done manually, although subsequently identifying those features in the data can (often, but not always) be automated.

    2. A bit cynical: having "friendly" data. On one data set, you may have great performance. Pick a different data set, one that seems like it ought to work just as well, and your performance may be much worse.

    In this case, the features are "low-level features extractable from disassemblers, with additional string and symbol information". This is the area where the authors may be able to claim some progress. They are looking for things like symbols and strings, for library calls, control graphs of functions, and the structure of abstract-syntax-trees of the decompiled code.

    When they stripped the symbol tables, this reduced their classification accuracy by 24%, so that was an important, but not critical, source of information.

    Thinking about the feature sets, and how you program: What are your favorite library functions. Do you use recursion? Write short functions or long ones? With lots of parameters or few? What language features do you use, that would show up in binary code: Abstract classes? Interfaces? There are lots of individuals bits and pieces that, together, may well lead to a personal "fingerprint".

    I give the authors a lot of credit for section (D) in the paper, where they try their methods on real repositories clones from Github. After lots of waffling and excuses (see my point above about "friendly" data), their accuracy turns out to be 65% on a pool of 90 authors. This should be compared with the 96% accuracy claimed in the abstract for a pool of 100 authors. The 65% is their real-world result.

    tl;dr: It's an interesting paper. However, we don't have to worry just yet. A 65% real-world accuracy from a pool of 100 isn't really dangerous to anonymity. There are a lot more than 100 programmers in the world, and accuracy decreases with pool size.

    --
    Everyone is somebody else's weirdo.
    • (Score: 2) by Common Joe on Wednesday March 21 2018, @08:10PM

      by Common Joe (33) <common.joe.0101NO@SPAMgmail.com> on Wednesday March 21 2018, @08:10PM (#656310) Journal

      This sounds an awful lot like graphology -- sounds good on the surface, but really doesn't do the job. But, of course, I don't really know if "they" are trying to frighten us or they found something interesting.

      With that said, is it worth some people risking their lives to write code? I suppose they'll have to make those decisions.

      Not trying to refute anything anything you said or come to any conclusions. Merely thinking out loud.

    • (Score: 2) by maxwell demon on Thursday March 22 2018, @12:42PM

      by maxwell demon (1608) on Thursday March 22 2018, @12:42PM (#656567) Journal

      I'd also expect that the better optimizers get, the harder it will become to analyze coding patterns from the compiled source. Did the author write a large function, or did he write many small functions that got inlined? Did the author write a loop, or did he write a tail-recursive function and the compiler did tail-recursion elimination? Maybe the function as written wasn't even tail-recursive, but another optimization step by chance changed it into a tail-recursive function?

      --
      The Tao of math: The numbers you can count are not the real numbers.
  • (Score: 4, Funny) by Anonymous Coward on Wednesday March 21 2018, @07:45PM (2 children)

    by Anonymous Coward on Wednesday March 21 2018, @07:45PM (#656302)

    This is why I always use anonymous functions.

    • (Score: 2) by LoRdTAW on Thursday March 22 2018, @02:46AM (1 child)

      by LoRdTAW (3755) on Thursday March 22 2018, @02:46AM (#656454) Journal

      Hipster! I bet you're a dirty JS coder too.

      • (Score: 1, Funny) by Anonymous Coward on Thursday March 22 2018, @09:52AM

        by Anonymous Coward on Thursday March 22 2018, @09:52AM (#656535)

        How can I be called a "hipster" if my cosplay name is Lambda Closure (spoken with a French accent).

  • (Score: 1, Interesting) by Anonymous Coward on Thursday March 22 2018, @10:45AM

    by Anonymous Coward on Thursday March 22 2018, @10:45AM (#656546)

    "unmask the creators of anti-censorship tools and other outlawed programs"

    This is clearly written by people in a country where freedom of speech is illegal.

(1)