Stories
Slash Boxes
Comments

SoylentNews is people

posted by martyb on Wednesday March 21 2018, @04:35PM   Printer-friendly
from the rigid-coding-guidelines++ dept.

Anonymous coders can be identified using stylometry and machine learning techniques applied to executable binaries:

Source code stylometry – analyzing the syntax of source code for clues about the author – is an established technique used in digital forensics. As the US Army Research Laboratory (ARL) puts it, "Stylometry research has proven that anonymous code contributors can be de-anonymized to reveal the original author, provided the author has published code before."

The technique can help identify virus makers as well as unmask the creators of anti-censorship tools and other outlawed programs. It has the potential to pierce the privacy that many programmers assume they have.

Source code is designed to be human-readable, but binaries – typically produced by compiling or assembling source code – have fewer characteristics that may suggest authorship. Toolchains can be instructed to strip out variable names, function names and other symbols and metadata – which may say something about the author – and alter the structure of code through optimization.

Nonetheless, the researchers – Aylin Caliskan, Fabian Yamaguchi, Edwin Dauber, Richard Harang, Konrad Rieck, Rachel Greenstadt and Arvind Narayanan – building on work described in a 2011 paper, demonstrate that binary files can be analyzed using machine-learning and stylometric techniques.

If you want to remain an anonymous coder, you'd better not contribute anything under your own name publicly:

When Coding Style Survives Compilation: De-anonymizing Programmers from Executable Binaries (arXiv:1512.08546 [cs.CR])

We evaluate our approach on data from the Google Code Jam, obtaining attribution accuracy of up to 96% with 100 and 83% with 600 candidate programmers. We present an executable binary authorship attribution approach, for the first time, that is robust to basic obfuscations, a range of compiler optimization settings, and binaries that have been stripped of their symbol tables. We perform programmer de-anonymization using both obfuscated binaries, and real-world code found "in the wild" in single-author GitHub repositories and the recently leaked Nulled.IO hacker forum. We show that programmers who would like to remain anonymous need to take extreme countermeasures to protect their privacy.


Original Submission

 
This discussion has been archived. No new comments can be posted.
Display Options Threshold/Breakthrough Mark All as Read Mark All as Unread
The Fine Print: The following comments are owned by whoever posted them. We are not responsible for them in any way.
  • (Score: 5, Insightful) by bradley13 on Wednesday March 21 2018, @07:16PM (2 children)

    by bradley13 (3053) on Wednesday March 21 2018, @07:16PM (#656293) Homepage Journal

    I used to do research in exactly this type of machine learning. FWIW, the field has not advanced very much in the past 25 years or so. There are two keys to success:

    1. Defining the features that can be used for the learning. This is done manually, although subsequently identifying those features in the data can (often, but not always) be automated.

    2. A bit cynical: having "friendly" data. On one data set, you may have great performance. Pick a different data set, one that seems like it ought to work just as well, and your performance may be much worse.

    In this case, the features are "low-level features extractable from disassemblers, with additional string and symbol information". This is the area where the authors may be able to claim some progress. They are looking for things like symbols and strings, for library calls, control graphs of functions, and the structure of abstract-syntax-trees of the decompiled code.

    When they stripped the symbol tables, this reduced their classification accuracy by 24%, so that was an important, but not critical, source of information.

    Thinking about the feature sets, and how you program: What are your favorite library functions. Do you use recursion? Write short functions or long ones? With lots of parameters or few? What language features do you use, that would show up in binary code: Abstract classes? Interfaces? There are lots of individuals bits and pieces that, together, may well lead to a personal "fingerprint".

    I give the authors a lot of credit for section (D) in the paper, where they try their methods on real repositories clones from Github. After lots of waffling and excuses (see my point above about "friendly" data), their accuracy turns out to be 65% on a pool of 90 authors. This should be compared with the 96% accuracy claimed in the abstract for a pool of 100 authors. The 65% is their real-world result.

    tl;dr: It's an interesting paper. However, we don't have to worry just yet. A 65% real-world accuracy from a pool of 100 isn't really dangerous to anonymity. There are a lot more than 100 programmers in the world, and accuracy decreases with pool size.

    --
    Everyone is somebody else's weirdo.
    Starting Score:    1  point
    Moderation   +3  
       Insightful=2, Interesting=1, Total=3
    Extra 'Insightful' Modifier   0  
    Karma-Bonus Modifier   +1  

    Total Score:   5  
  • (Score: 2) by Common Joe on Wednesday March 21 2018, @08:10PM

    by Common Joe (33) <common.joe.0101NO@SPAMgmail.com> on Wednesday March 21 2018, @08:10PM (#656310) Journal

    This sounds an awful lot like graphology -- sounds good on the surface, but really doesn't do the job. But, of course, I don't really know if "they" are trying to frighten us or they found something interesting.

    With that said, is it worth some people risking their lives to write code? I suppose they'll have to make those decisions.

    Not trying to refute anything anything you said or come to any conclusions. Merely thinking out loud.

  • (Score: 2) by maxwell demon on Thursday March 22 2018, @12:42PM

    by maxwell demon (1608) on Thursday March 22 2018, @12:42PM (#656567) Journal

    I'd also expect that the better optimizers get, the harder it will become to analyze coding patterns from the compiled source. Did the author write a large function, or did he write many small functions that got inlined? Did the author write a loop, or did he write a tail-recursive function and the compiler did tail-recursion elimination? Maybe the function as written wasn't even tail-recursive, but another optimization step by chance changed it into a tail-recursive function?

    --
    The Tao of math: The numbers you can count are not the real numbers.