Stories
Slash Boxes
Comments

SoylentNews is people

posted by chromas on Monday August 13 2018, @02:22PM   Printer-friendly

Wired is reporting on a presentation given at Def Con 26 by Rachel Greenstadt, an associate professor of computer science at Drexel University, and Aylin Caliskan, Greenstadt's former PhD student and now an assistant professor at George Washington University, entitled Even Anonymous Coders Leave Fingerprints. Stylistic expression is uniquely identifiable and not anonymous, that includes code especially. There are privacy implications for many developers because as few as 50 metrics are needed to distinguish one coder from another.

The researchers don't rely on low-level features, like how code was formatted. Instead, they create "abstract syntax trees," which reflect code's underlying structure, rather than its arbitrary components. Their technique is akin to prioritizing someone's sentence structure, instead of whether they indent each line in a paragraph.


Original Submission

This discussion has been archived. No new comments can be posted.
Display Options Threshold/Breakthrough Mark All as Read Mark All as Unread
The Fine Print: The following comments are owned by whoever posted them. We are not responsible for them in any way.
(1)
  • (Score: 2) by takyon on Monday August 13 2018, @02:47PM (5 children)

    by takyon (881) <takyonNO@SPAMsoylentnews.org> on Monday August 13 2018, @02:47PM (#720993) Journal

    Anonymous Coders Could be Identified Even from Compiled Code [soylentnews.org]

    Using the Internet will be your eventual death sentence.

    --
    [SIG] 10/28/2017: Soylent Upgrade v14 [soylentnews.org]
    • (Score: 2) by fyngyrz on Monday August 13 2018, @06:51PM (4 children)

      by fyngyrz (6567) on Monday August 13 2018, @06:51PM (#721096) Journal

      In other news, 99% of all c code positively identified by researchers as having been written by Kernighan & Ritchie.


      if (closedStyle) {
          braceForResults();
      }

      • (Score: 2) by legont on Tuesday August 14 2018, @01:13AM (3 children)

        by legont (4179) on Tuesday August 14 2018, @01:13AM (#721200)

        Hate it. Should be:

        if( closedStyle ) {
            braceForResults();
        }

        That's so I can search for " closedStyle" and find them all

        --
        "Wealth is the relentless enemy of understanding" - John Kenneth Galbraith.
        • (Score: 2) by fyngyrz on Tuesday August 14 2018, @01:46AM (2 children)

          by fyngyrz (6567) on Tuesday August 14 2018, @01:46AM (#721203) Journal

          That's so I can search for " closedStyle" and find them all

          Sounds like you need a better editor. Or search algorithm. :)

          • (Score: 2) by legont on Tuesday August 14 2018, @03:59AM (1 child)

            by legont (4179) on Tuesday August 14 2018, @03:59AM (#721229)

            Nah, style got to support simplicity. If choosing between style and algorithm, algorithm is to die first.;)

            --
            "Wealth is the relentless enemy of understanding" - John Kenneth Galbraith.
            • (Score: 2) by fyngyrz on Tuesday August 14 2018, @04:13PM

              by fyngyrz (6567) on Tuesday August 14 2018, @04:13PM (#721408) Journal

              The way I see it, "simple" means not having to type a certain way so your search will actually work. Then the algorithm supports whatever you do, rather than you supporting the algorithm.

              But what do I know. :)

  • (Score: 4, Interesting) by Hyperturtle on Monday August 13 2018, @02:55PM (1 child)

    by Hyperturtle (2824) on Monday August 13 2018, @02:55PM (#720996)

    As a network engineer, I have known this for years.

    For the devices with CLIs, and often, the ticketing systems in the organization with the hardware -- you can get a feel for who wrote what and what to expect when you go to look at that engineer's results.

    I expect it's no different with programming or technical writing or master thesis statements. It also means that it becomes easy to tell who ran the wizard, copied from the internet, or shamelessly took someone else's work and used it as their own without so much as removing inadvertant metadata because the copier didn't understand what was going on.

    This presents good and bad things to any individual -- if you are a fraud, it is easier to spot without necessarily having those checking really understand the work. And if you are not a fraud, you are more easily identified because of it.

    And if you use the wizard and then copy the wizards configs to make it look like you know what you are doing, that is also easy to spot... it's like that kids song "one of these things is not like the others, one of these things is not the same..." who can spot the generic wizard auto-script hidden in the 'custom configuration'?

    That's another good way to identify who claims to be an expert but isn't, or who leverages tools available to them without unnecessarily reinventing the wheel.

    (and for those of us saving time and money, try to remember to remove the references to example.com before you blame the network... some network engineers CAN sniff the traffic and see that it doesn't work because the default domains in the example were not changed to reflect the business requirements!!)

    Anyway the takeaway for me is that it's always been possible to determine who is writing what--given enough time and examples. Eventually, their style, or lack of one--comes through. This helps immensely in determining who is really writing their code (as opposed to everyone's favorite that outsourced his job and is just collecting the checks), who is struggling, who's a wizard and who's not, etc...

    If this is alarming, then try to take the proper opsec to make sure you are harder to identify. Soylent also makes a great practice ground for your opsec training... Given enough examples, I am sure we can identify anyone that writes profilically and then tries to pass as anonymous coward... try to find the hidden Hyperturtle or Runaway or whoever! (Not that we would ever do that...)

     

    • (Score: 4, Insightful) by Runaway1956 on Monday August 13 2018, @03:34PM

      by Runaway1956 (2926) Subscriber Badge on Monday August 13 2018, @03:34PM (#721009) Journal

      I think there is a nuance here, that you may or may not be seeing. Sure, an expert in any given field can spot subtle differences between the work of his peers, or the work of his subordinates. He has the knowledge and experience, he can perform whatever task at hand in a dozen different ways. He KNOWS his field, and can quickly come to know the people in his field.

      These people seem to be promising a new tool to managers and law enforcement, that will enable non-experts to determine who has done what, and how they did it. Plug and play script kiddies using an AI to figure out who the "good guy" and the "bad guy" is.

      When you tell your supervisor that "Bob didn't write this, it's over his head, and none of the writing matches his work", that is treated like an opinion, and weighted according to a purely subjective point of view. If the computer tells them the same thing, well, "THAT'S SCIENCE!!" Expect to see this introduced into a court of law as evidence, one day soon. Even before that, expect to see it in the hands of an HR drone, justifying one person getting a raise, and another person being fired.

      On a sidenote - my handwriting is very distinctive. It's ugly, it's large, and I write with the same brand, style, and size of bold black pen all the time. To boot, I sign or initial pretty much every piece of paper I touch. Recently, one of the managers who sees my handwriting all the time and should know better, accused first me, then a couple other people of hanging a red tag on a piece of equipment. No signature, written with a cheap blue pen, in small, precise cursive letters. It almost, but didn't quite, convince me that it was a woman's handwriting.

      Oftentimes, the very people who should know, have the fewest clues to work with. My immediate supervisor had to tell the wannabe-manager that none of the people he accused could have done it. One of the persons accused is not even literate in English!! (From all accounts, he's very good in Spanish, but I couldn't verify that with my limited vocabulary!)

  • (Score: 1, Insightful) by Anonymous Coward on Monday August 13 2018, @03:15PM (2 children)

    by Anonymous Coward on Monday August 13 2018, @03:15PM (#721003)

    Copy all your code from StackOverflow.

    • (Score: 2) by takyon on Monday August 13 2018, @03:38PM

      by takyon (881) <takyonNO@SPAMsoylentnews.org> on Monday August 13 2018, @03:38PM (#721011) Journal

      Parent AC gave the only workable solution.

      Thank G_d for StackOverflow! Suck it, MDC! [soylentnews.org]

      --
      [SIG] 10/28/2017: Soylent Upgrade v14 [soylentnews.org]
    • (Score: 0) by Anonymous Coward on Tuesday August 14 2018, @02:36PM

      by Anonymous Coward on Tuesday August 14 2018, @02:36PM (#721376)

      I thought the solution would be to wear rubber gloves while coding. :-)

  • (Score: 0) by Anonymous Coward on Monday August 13 2018, @03:31PM (2 children)

    by Anonymous Coward on Monday August 13 2018, @03:31PM (#721008)

    It's a catchy phrase, but fingerprints are obviously more uniquely identifiable. If you want to go all CSI/Forensics, a much closer analogy is tool marks... or bullet ballistics.

    • (Score: 2) by Runaway1956 on Monday August 13 2018, @03:35PM (1 child)

      by Runaway1956 (2926) Subscriber Badge on Monday August 13 2018, @03:35PM (#721010) Journal

      Go easy on that ballistics nonsense. You'll end up triggering someone!

      • (Score: 0) by Anonymous Coward on Monday August 13 2018, @03:46PM

        by Anonymous Coward on Monday August 13 2018, @03:46PM (#721015)

        Don't worry you get triggered enough for everyone.

        *STAND DOWN SJW HIT SQUAD!*

  • (Score: 2) by ikanreed on Monday August 13 2018, @04:02PM (6 children)

    by ikanreed (3164) Subscriber Badge on Monday August 13 2018, @04:02PM (#721021) Journal

    1. Go to random subroutines, and put a comment at the top consisting of this text: //Who the fuck writes this garbage? I would never have done anything this fucking stupid
    2. Include several if(!condition){//handle this later I'm sure it won't come up}
    3. Follow zero project-wide indentation and code-style rules.

    • (Score: 2) by takyon on Monday August 13 2018, @04:34PM (3 children)

      by takyon (881) <takyonNO@SPAMsoylentnews.org> on Monday August 13 2018, @04:34PM (#721032) Journal

      All or part of that could make your code easier to identify.

      Some real solutions are to copy or "steal" code, have other parts of your code written, tidied, or obfuscated by computers (not you) if possible, don't share code if you can't take the previous steps, or never post code that can be linked to your real name or identity, so that your code (written however you like it) can only be linked from one Anon (you) to another (also you).

      --
      [SIG] 10/28/2017: Soylent Upgrade v14 [soylentnews.org]
      • (Score: 4, Funny) by ikanreed on Monday August 13 2018, @04:42PM (2 children)

        by ikanreed (3164) Subscriber Badge on Monday August 13 2018, @04:42PM (#721036) Journal

        I was trying to joke about what it seems like every coder does. I knew when I was posting it it was a kinda limp joke. Didn't realize it was so flaccid as to be unrecognizable.

        • (Score: 2) by takyon on Monday August 13 2018, @04:59PM (1 child)

          by takyon (881) <takyonNO@SPAMsoylentnews.org> on Monday August 13 2018, @04:59PM (#721042) Journal

          The problem is that somebody is going to end up reading this [ic.ac.uk] and consider it a legit strategy for writing anonymous code.

          --
          [SIG] 10/28/2017: Soylent Upgrade v14 [soylentnews.org]
          • (Score: 0) by Anonymous Coward on Tuesday August 14 2018, @02:47PM

            by Anonymous Coward on Tuesday August 14 2018, @02:47PM (#721383)

            omg I love that!!!

    • (Score: 0) by Anonymous Coward on Monday August 13 2018, @05:57PM

      by Anonymous Coward on Monday August 13 2018, @05:57PM (#721070)

      Those things don't go into the compiled code.

    • (Score: 0) by Anonymous Coward on Tuesday August 14 2018, @05:57PM

      by Anonymous Coward on Tuesday August 14 2018, @05:57PM (#721451)

      That's not fair, I at least put in:

      if (badcondition) { throw new Exception("How did that happen? "); }

  • (Score: 2) by looorg on Monday August 13 2018, @04:31PM

    by looorg (578) on Monday August 13 2018, @04:31PM (#721029)

    Not to sound all grumpy but Textanalysis has come to source code ... who could have guessed. Nothing said it had to be "written" text as words and/or sentences. People putting any word to paper (or screen) apply themselves somehow to their work, no matter if it's written text or source code.

  • (Score: 2) by Snotnose on Monday August 13 2018, @05:08PM (3 children)

    by Snotnose (1623) on Monday August 13 2018, @05:08PM (#721047)

    As a C programmer I used ? quite a bit. Well, not quite a bit but probably 2-3 times more often than the next guy. Some people hate it. "It's too complicated, I don't understand it". Not my problem, learn the language.

    Now I'm doing Java and OO. There are a lot of subtleties in those libraries, and as I figure them out my code changes, sometimes radically. I doubt you could match last year's Java (C with Java syntax) with today's Java (Java with OO as a baseline) to identify me as the author.

    --
    When the dust settled America realized it was saved by a porn star.
    • (Score: 2) by maxwell demon on Monday August 13 2018, @06:34PM

      by maxwell demon (1608) on Monday August 13 2018, @06:34PM (#721084) Journal

      So in other words, as long as you are not new to the language it's easy to identify you as the author: It's when nobody else understands your code. ;-)

      --
      The Tao of math: The numbers you can count are not the real numbers.
    • (Score: 0) by Anonymous Coward on Monday August 13 2018, @06:57PM

      by Anonymous Coward on Monday August 13 2018, @06:57PM (#721100)

      I once worked on a contract where we had to add features to an existing C coded program, but we were explicitly prohibited from modifying any existing code unless absolutely necessary to the new features. The existing code had been written by two people years ago -- one a very experienced C programmer, and the other a history major graduate who was still learning to code. (don't ask; nobody could explain that one to me, either)

      Who wrote which code, given the extremes of experience, was so clear that it was funny. You could even tell when the history guy wrote which part, as his learning curve was evident. The hardest part of the project was keeping myself from cleaning up his code. The client was adamant about that though, so... *shrug*

      I can't imagine being able to automate the detection of something like that, to be honest. But then there's a lot of things I don't know about.

      Oh, and ? is absolutely a perfectly cromulent operator, indeed. Personally I only used it when the operators were pretty simple, though. No need to deliberately obfuscate code for the next person working on it -- and plenty of times, the next person working on it is you, long after you forgot what the heck it was you were doing.

    • (Score: 2) by KritonK on Monday August 13 2018, @08:17PM

      by KritonK (465) on Monday August 13 2018, @08:17PM (#721121)

      I thought I'd mention that ? works in Java, as well. The , operator works as well, at least in for statements.

  • (Score: 2) by stretch611 on Monday August 13 2018, @06:49PM

    by stretch611 (6199) on Monday August 13 2018, @06:49PM (#721094)

    If you are in a small company or team environment, this will be obvious. People on your team will pick up on your habits in coding style. The more obvious the style the sooner and easier it will be to pick up. However, the people that work with you will learn far quicker about your ability based on how you interact with them, what questions you ask, what suggestions and ideas you offer, and how many times they see you browsing over to copy code from various public websites. And even if you work remotely, they will learn how smart you are coding and learn how often you google for code... After all, a person who is clueless in every meeting is not going write good code without leaning on the rest of the team asking for constant help.

    So in a non-anonymous team environment you will not be able to hide your style or lack thereof.

    When you write code, you do tend to write it based on previous experience, the more you did something in the past, the more likely you are to do it in the future. It affects all aspects of programming. Do you write functions 125 lines long or do you create smaller functions no larger than one screen of code? How and where do you declare variables? do you abuse globals? do you always set a default or never and are your defaults, zero, empty strings, or nulls, only initial values? Do you actually write comments and are they actually useful? How about variable names? two letter variables or full words, and do you use camelCase? lack all capitalizations, use underscores between words, or only capitalize the first letter? Even how you organize functions into libraries can be a sign of distinction between coders, so can using included source code.

    But, here is the real problem with identifying code... If you are in a company, your team will likely be able to determine it was you with very little effort at all. If you are on a huge public project on the internet, people probably will not spend the time and effort to look at your contributions... especially if they work. (of course if your stuff causes problems constantly, other people will be constantly looking at it to figure out how to fix it and your sloppy crap will be easy to spot) If you truly keep it anonymous, the cost to trace source code back to you based on full analysis of your coding habits will rarely be worth the effort. If it is worth the effort, any idiot should realize that source control requiring non-anonymous logins should be a requirement on the project to begin with.

    --
    Now with 5 covid vaccine shots/boosters altering my DNA :P
  • (Score: 2) by Bot on Monday August 13 2018, @07:20PM

    by Bot (3902) on Monday August 13 2018, @07:20PM (#721109) Journal

    - see here, the cracker worked all night, had installed a firmware backdoor, had booted the cracked workstation onto the corporate network, had deployed his payload. The change was of course logged so we could give a look at it. The code was obfuscated, and we have a hunch that it was compiled, then reverse engineered and recompiled on a different tool, all of it of course with aggressive optimization flags. A real mess.
    - so he got away with it?
    - no we got an ID at the police station by noon.
    - wow, did they see through all that obfuscation?
    - kinda. The guy left his fingerprints on the keyboard.

    --
    Account abandoned.
  • (Score: 2) by pipedwho on Monday August 13 2018, @09:49PM

    by pipedwho (2032) on Monday August 13 2018, @09:49PM (#721136)

    The problem with this sort of technique is that the reliability of the match drops quickly as the search space grows relative to the number and quality of markers being used.

    So comparing a sample set of 100 coders may yield excellent results at 99% accuracy, while the match at 10000 coders is likely to result in 100 matches that are indistinguishable from each other with any semblance of probability. Increasing the search space makes this worse.

    And that assumes you have a reliable sample set to use as a reference. With online proliferation of information copy/paste and reference material/examples, the search space cannot be easily categorised in the same way DNA can be used to narrow down the search to family members cross referenced in other ways. Additionally, at higher search quantities the reliability drops to a point that a malfeasant intentionally doing a few things they normally avoid doing would likely skew them out of the match, or require the matching algorithms to be even less accurate (and therefore harvesting an even larger set of false positives to ween through).

(1)