Stories
Slash Boxes
Comments

SoylentNews is people

posted by janrinok on Tuesday June 24 2014, @05:15PM   Printer-friendly
from the never-underestimate-the-power-of-a-geek dept.

In the latest gaffe to demonstrate the privacy perils of anonymized data, New York City officials have inadvertently revealed the detailed comings and goings of individual taxi drivers over more than 173 million trips. City officials released the data in response to a public records request and specifically obscured the drivers' hack license numbers and medallion numbers. Rather than including those numbers in plaintext, the 20 gigabyte file contained one-way cryptographic hashes using the MD5 algorithm.

It turns out there's a significant flaw in the approach. Because both the medallion and hack numbers are structured in predictable patterns, it was trivial to run all possible iterations through the same MD5 algorithm and then compare the output to the data contained in the 20GB file. Software developer Vijay Pandurangan did just that and in less than two hours he had completely de-anonymized all 173 million entries.

Looks like a low salt diet isn't all its cracked up to be!

This discussion has been archived. No new comments can be posted.
Display Options Threshold/Breakthrough Mark All as Read Mark All as Unread
The Fine Print: The following comments are owned by whoever posted them. We are not responsible for them in any way.
  • (Score: 2, Interesting) by VLM on Tuesday June 24 2014, @05:38PM

    by VLM (445) Subscriber Badge on Tuesday June 24 2014, @05:38PM (#59498)

    Not living there,

    "medallion number 9Y99 or hack number 5296319,"

    still look pretty anonymous to me.

    There may be a website where you can look up medallions and hacks. No idea what a hack is, but I think a medallion is a limited issue license to operate a cab. The cabbies want no competition to keep fares high, so they purchased a limitation to competition from the politicians.

    • (Score: 0) by Anonymous Coward on Tuesday June 24 2014, @05:52PM

      by Anonymous Coward on Tuesday June 24 2014, @05:52PM (#59504)

      A hack is a driver.

      Looks like someone did not make a salt table to really make it anonymous with real random numbers instead of md5 converted from known data.

      md5 is not a random number but could be. Especially if you use a known input and no salt.

      I didnt look but there probably is a driver -> medallion number database out there. That way people could complain...

      • (Score: 2) by frojack on Tuesday June 24 2014, @06:53PM

        by frojack (1554) on Tuesday June 24 2014, @06:53PM (#59534) Journal

        There is also the problem of encryption (by what ever means) of fix length fields, where the length and structure of the field is known. This case is a prime example of this.

        This is a common and well known problem.

        The story seems more concerned about protecting the drivers than the paying passengers.

        --
        No, you are mistaken. I've always had this sig.
    • (Score: 1, Interesting) by Anonymous Coward on Tuesday June 24 2014, @05:58PM

      by Anonymous Coward on Tuesday June 24 2014, @05:58PM (#59507)

      Let me read the fine article for you: There's a ton of resources on NYC Taxi and Limousine commission, including a mapping from licence number to driver name, and a way to look up owners of medallions.

      That's from the article on Medium linked to in the Ars article. I guess it was omitted to protect the innocent and to get people focus on what really matters.

  • (Score: 2, Interesting) by Anonymous Coward on Tuesday June 24 2014, @06:24PM

    by Anonymous Coward on Tuesday June 24 2014, @06:24PM (#59519)

    The privacy of the taxi drivers' is more of an operational thing - they aren't likely to be doing much of anything while they are working but work. I'm not saying they don't deserve privacy, I'm just saying that in the scheme of things taxi driver opsec is much less of an issue than the privacy of the passengers. And even if perfectly anonymized, this data can tell you a whole lot about passengers when cross-referenced with other "meta-data" like phone, credit-card, rental, employment, etc records.

    For example, daily taxi rides from an apartment building to an office building indicate that someone that lives in the building works at that office. Then there is a trip from the apartment building to an abortion clinic followed by a week of no taxi rides to that office building. If you know which of your neighbors work at that office because you saw their paystub in the mail or made smalltalk about work in the lobby, you now know they probably had an abortion too.

  • (Score: 2) by FatPhil on Tuesday June 24 2014, @06:38PM

    by FatPhil (863) <pc-soylentNO@SPAMasdf.fi> on Tuesday June 24 2014, @06:38PM (#59526) Homepage
    Rather than using MD5, they could have used this function:

    Blank(n) = ""

    Why did they feel the need to give out identifying information in the first place, in any form?
    --
    Great minds discuss ideas; average minds discuss events; small minds discuss people; the smallest discuss themselves
  • (Score: 2, Interesting) by Anonymous Coward on Tuesday June 24 2014, @06:49PM

    by Anonymous Coward on Tuesday June 24 2014, @06:49PM (#59532)

    Why do we assume it's a mistake? Because they said so? Now agencies have free reign to track every cab, since it's all public they need no warrant, etc., to find out where cab X was on day Y at time Z. Sounds like someone's job just got a hell of a lot easier thanks to this "mistake".

  • (Score: 0) by Anonymous Coward on Tuesday June 24 2014, @06:52PM

    by Anonymous Coward on Tuesday June 24 2014, @06:52PM (#59533)

    Or rather, how many people were fired?

    • (Score: 2, Funny) by Horse With Stripes on Tuesday June 24 2014, @10:16PM

      by Horse With Stripes (577) on Tuesday June 24 2014, @10:16PM (#59596)

      We're not sure who they were because their SSN's and Employee IDs were hashed via md5 and there's no possible way to ever break that type of super anonymization. And I mean never.

      • (Score: 0) by Anonymous Coward on Wednesday June 25 2014, @01:13PM

        by Anonymous Coward on Wednesday June 25 2014, @01:13PM (#59845)

        No, the people who were tasked with encrypting SSNs and employee IDs knew that doing an MD5 hash is not secure. That's why they resorted to triple-rot13 instead. No one will crack that.

  • (Score: 4, Insightful) by kaganar on Tuesday June 24 2014, @07:03PM

    by kaganar (605) on Tuesday June 24 2014, @07:03PM (#59538)

    I'm a "Software Engineer" and I'm the first to admit that software engineering isn't an engineering discipline. If a structural engineer designs a bridge that fails catastrophically and he's determined to be the cause he'll likely lose his license. If a software engineer designs software that fails catastrophically, well... what license?

    • (Score: 0) by Anonymous Coward on Tuesday June 24 2014, @07:23PM

      by Anonymous Coward on Tuesday June 24 2014, @07:23PM (#59544)

      That's why software was remarketed as a "service", to avoid fines. Bad products must be recalled and companies that produce them can be fined. Bad services do not need to be recalled and are much harder to fine (due to flukes in legal lingo). That's why "software as a service" took off, and is still taking off: so corporations can avoid existing laws on creating bad products.

    • (Score: 4, Insightful) by Anonymous Coward on Tuesday June 24 2014, @07:27PM

      by Anonymous Coward on Tuesday June 24 2014, @07:27PM (#59545)

      I'm the first to admit that software engineering isn't an engineering discipline.

      It is an engineering discipline. You are wrong. What is an engineer? Here's wikipedia's definition, which seems appropriate:

      An engineer is a professional practitioner of engineering, concerned with applying scientific knowledge, mathematics, and ingenuity to develop solutions for technical, societal and commercial problems. Engineers design materials, structures, and systems while considering the limitations imposed by practicality, regulation, safety, and cost.

      That's exactly what software engineers do (I hope!).

      Just because a software engineer can get away building crap while a structural engineer cannot, does not mean the software engineer is not an engineer. It just means they're held to different standards.

    • (Score: 1) by crAckZ on Tuesday June 24 2014, @07:48PM

      by crAckZ (3501) on Tuesday June 24 2014, @07:48PM (#59556) Journal

      They can't do that....who would give us our next version of Windows?

    • (Score: 3, Insightful) by Hairyfeet on Wednesday June 25 2014, @06:01AM

      by Hairyfeet (75) <{bassbeast1968} {at} {gmail.com}> on Wednesday June 25 2014, @06:01AM (#59703) Journal

      You can't hold software engineers to the same standard because there are too many variables. With a bridge engineer he KNOWS what strength the steel is, what amount of stress concrete can hold, etc so if they follow the plans you'll come out with the exact same bridge every time. With computers you are dealing with everything from circuits made at the nm scale (ever see how much errata the average CPU has?) to cosmic rays being able to turn 1s into 0s, its just too unpredictable.

      I mean sure you have real time systems designed to not have any variation...and those typically are using chips about as powerful as the 386,486 if you are lucky (because simpler designs leave less to go wrong) and are using a RTOS that is only doing a single task at a time and is VERY simplistic compared to what we consider a modern OS.

      To hold software engineering to the same standard would mean sending the entire computing landscape back to the late 1980s and i really can't see too many people giving up their nice Windows desktops and iPads to go back to late DOS era computing. If you wanna see what we'd be dealing with look at instances where it just can't fail, like air traffic control and military weapon systems and you'll see compared to a modern multitasking OS and software its VERY primitive, locked down, and single task heavy. Personally i like being able to type this while converting a video and burning a DVD, thanks ever so.

      --
      ACs are never seen so don't bother. Always ready to show SJWs for the racists they are.
  • (Score: 2) by Adrian Harvey on Tuesday June 24 2014, @09:34PM

    by Adrian Harvey (222) on Tuesday June 24 2014, @09:34PM (#59586)

    I'm not sure that this data should be public at all! Yeah, I know, too late now.. But with the GPS co-ordiantes of all pick-ups and drop-offs, it wouldn't be too hard to de-anonymise huge swathes of the rest of the data too, and to track individuals. You could see who travels when, and to where. It might be ok if you truncated the GPS values such that they covered a few blocks, but even so, maintaining privacy for indiviaual travellers will get harder over time as analysis techniques get better...

    • (Score: 3, Interesting) by MrGuy on Tuesday June 24 2014, @11:47PM

      by MrGuy (1007) on Tuesday June 24 2014, @11:47PM (#59620)

      with the GPS co-ordiantes of all pick-ups and drop-offs, it wouldn't be too hard to de-anonymise huge swathes of the rest of the data too, and to track individuals

      In a large number of places, this would be true. But we're talking about New York City, the most densely populated city in the country. Even if we could assume that everyone who hailed a cab lived near one end of the cab journey or the other (a strong assumption), you could resolve the GPS coordinates to an absolutely precise location, you'll have dozens (at least) of people who could plausibly live near that location (especially when you factor in people who hail cabs walking to the corner to have a better chance to snag one).

      It would be VERY hard to de-anonymize "huge swathes" of the rest of the data. Or, at least, to do so in a meaningful way.

      Could you come to the conclusion "someone who lives within a short distance of this place probably works in one of these buildings?" Sure. Could you get from there to "this trip was probably taken by person X, who lives near point A and works in point B?" Maybe (though you'd need to have an awful lot of "who works where precisely" data first, which is a dataset I'm way more concerned about than this).

      But even if you did, you're bootstrapping something you ALREADY KNOW (where a person lives and works) to prove something else you ALREADY KNOW (that person travels from home to work) - maybe, maybe I can suspect that these 5 cab trips from point A to point B were in fact taken by John, who lives at point A and works at point B. But so what? What insight are you gaining by retroactively figuring out who the traveler was? At best, maybe you learn when John leaves for work in the morning.

      What you CAN'T do is extend what you know (lives in A, works in B) to track any OTHER trips John takes, which is the usual fear about someone "de-anonymizing" data. You can't leverage "I've identified it was probably John who took this trip from A to B" to ALSO know "it was that same John who traveled from C to D." You CAN do that with the cabs (the same cab that took John to work later picked up someone who was probably Sally going home from work). But you don't get some "signature" about John you can use to track JOHN around the city.

      What exactly is the thing you're afraid could be "de-anonymized" in "vast swaths" of this data?

      • (Score: 2) by Adrian Harvey on Wednesday June 25 2014, @11:38AM

        by Adrian Harvey (222) on Wednesday June 25 2014, @11:38AM (#59794)

        I guess I had assumed that the taxi system from which this data came covered an area much larger than just the high-rises. I agree that if the pickup/drop off is near an apartment block then that pretty much covers my requirements for privacy. Not being overly familiar with New York however, I can't even tell from visualisations of this data (such as this one [vice.com] ) whether the spread of drop-off locations extends out of high rise territory or not. Now that I've read a bit more I might be concerned about data from the green taxis instead ;-). I understand they focus on the areas the yellow cabs don't...