Stories
Slash Boxes
Comments

SoylentNews is people

posted by Fnord666 on Thursday April 05 2018, @08:27PM   Printer-friendly
from the digital-fingerprints dept.

Zero-width characters are invisible, ‘non-printing’ characters that are not displayed by the majority of applications. F​or exam​ple, I’ve ins​erted 10 ze​ro-width spa​ces in​to thi​s sentence, c​an you tel​​l? (Hint: paste the sentence into Diff Checker to see the locations of the characters!). These characters can be used to ‘fingerprint’ text for certain users.

Well, the original reason isn’t too exciting. A few years ago I was a member of a team that participated in competitive tournaments across a variety of video games. This team had a private message board, used to post important announcements amongst other things. Eventually these announcements would appear elsewhere on the web, posted to mock the team and more significantly; ensuring the message board was redundant for sharing confidential information and tactics.

The security of the site seemed pretty tight so the theory was that a logged-in user was simply copying the announcement and posting it elsewhere. I created a script that allowed the team to invisibly fingerprint each announcement with the username of the user it is being displayed to.

I saw a lot of interest in zero-width characters from a recent post by Zach Aysan so I thought I’d publish this method here along with an interactive demo to share with everyone. The code examples have been updated to use modern JavaScript but the overall logic is the same.


Original Submission

 
This discussion has been archived. No new comments can be posted.
Display Options Threshold/Breakthrough Mark All as Read Mark All as Unread
The Fine Print: The following comments are owned by whoever posted them. We are not responsible for them in any way.
  • (Score: 2) by NewNic on Thursday April 05 2018, @08:39PM (14 children)

    by NewNic (6420) on Thursday April 05 2018, @08:39PM (#663093) Journal

    I tried using the webpage https://umpox.github.io/zero-width-detection/ [github.io] and it did not show my username. When I pasted it into Diff Checker, there were no zero width characters.

    I tried using Firefox and Chrome under Linux.

    --
    lib·er·tar·i·an·ism ˌlibərˈterēənizəm/ noun: Magical thinking that useful idiots mistake for serious political theory
    Starting Score:    1  point
    Karma-Bonus Modifier   +1  

    Total Score:   2  
  • (Score: 1, Funny) by Anonymous Coward on Thursday April 05 2018, @08:51PM

    by Anonymous Coward on Thursday April 05 2018, @08:51PM (#663100)

    You need to be wearing your captain crunch secret decoder ring.

  • (Score: 0) by Anonymous Coward on Thursday April 05 2018, @09:35PM

    by Anonymous Coward on Thursday April 05 2018, @09:35PM (#663116)

    Worked for me pasting into hexdump -C both in MSYS2/Windows and GNU/Linux.

  • (Score: 1) by speederaser on Thursday April 05 2018, @11:06PM (4 children)

    by speederaser (4049) on Thursday April 05 2018, @11:06PM (#663168)

    In Firefox and PaleMoon you can see them by right-clicking -> view page source. The inserted string shows as: "& # 8 2 0 3 ;" without the double quotes or the actual spaces inserted.

    • (Score: 2) by maxwell demon on Friday April 06 2018, @06:41AM (3 children)

      by maxwell demon (1608) on Friday April 06 2018, @06:41AM (#663301) Journal

      The inserted string shows as: "& # 8 2 0 3 ;" without the double quotes or the actual spaces inserted.

      So in other words, it shows as “​” — why not just write that?

      But then, this is not because of Firefox/Palemoon, but because that's what the site delivers to the browser. If the site had decided to deliver actual Unicode characters instead of HTML entities, then your browser would not show entities in the source.

      --
      The Tao of math: The numbers you can count are not the real numbers.
      • (Score: 1) by speederaser on Friday April 06 2018, @04:10PM (2 children)

        by speederaser (4049) on Friday April 06 2018, @04:10PM (#663466)

        So in other words, it shows as “​” — why not just write that?

        Because "preview" made it invisible when I did it that way, even when I used "Plain Old Text". Just like it appears in preview on this post.

        • (Score: 2) by maxwell demon on Friday April 06 2018, @04:53PM

          by maxwell demon (1608) on Friday April 06 2018, @04:53PM (#663482) Journal

          Hint: &

          --
          The Tao of math: The numbers you can count are not the real numbers.
        • (Score: 2) by Osamabobama on Friday April 06 2018, @09:55PM

          by Osamabobama (5842) on Friday April 06 2018, @09:55PM (#663555)

          My favorite is when the html sorcery is correct, so preview looks fine, but the text-entry window version of the comment also gets changed, so hitting Submit (or Preview, again) will post something else.

          For instance, if you want to show <i>html tags</i> in the post, you format your comment with escape characters so the correct tag shows up in the preview. However, the comment text also strips out the escape characters, so pressing Submit will then post the comment with the tags interpreted, rather than displayed.

          Each cycle is slightly different:

          1. &lti&gthtml tags&lt/i&gt
          2. <i>html tags</i>
          3. html tags
          --
          Appended to the end of comments you post. Max: 120 chars.
  • (Score: 3, Interesting) by FatPhil on Friday April 06 2018, @06:29AM (5 children)

    by FatPhil (863) <{pc-soylent} {at} {asdf.fi}> on Friday April 06 2018, @06:29AM (#663295) Homepage
    Well, firstly, that diffchecker webpage is retarded - it did absolutely nothing when I enabled JS for the not-obviously-spammy domains, and is clearly the wrong tool for the job. Anyone who thinks that the best way of analysing a stream of data for embedded invisible characters is by diffing it with something is using a hammer on a screw. The obvious tool for the job is of course od(1).

    $ echo -n 'F​or exam​ple, I’ve ins​erted 10 ze​ro-width spa​ces in​to thi​s sentence, c​an you tel​​l?' | od -c
    0000000 F 342 200 213 o r e x a m 342 200 213 p l
    0000020 e , I 342 200 231 v e i n s 342 200 213
    0000040 e r t e d 1 0 z e 342 200 213 r o
    0000060 - w i d t h s p a 342 200 213 c e s
    0000100 i n 342 200 213 t o t h i 342 200 213 s
    0000120 s e n t e n c e , c 342 200 213 a
    0000140 n y o u t e l 342 200 213 342 200 213 l
    0000160 ?
    0000161

    Or if you just want to count them:

    $ echo -n 'F​or exam​ple, I’ve ins​erted 10 ze​ro-width spa​ces in​to thi​s sentence, c​an you tel​​l?' | tr -d '[[:print:]]' | wc -c
    33

    All of which means the author can't count.

    However, you're misunderstanding him, he never claimed to have embedded your username in that sentence, only to have embedded some non-displaying characters.
    --
    Great minds discuss ideas; average minds discuss events; small minds discuss people; the smallest discuss themselves
    • (Score: 3, Interesting) by maxwell demon on Friday April 06 2018, @06:55AM (1 child)

      by maxwell demon (1608) on Friday April 06 2018, @06:55AM (#663309) Journal

      Actually, you can just look at it with less. Then you even get the Unicode code numbers in a readable form:

      F<U+200B>or exam<U+200B>ple, I’ve ins<U+200B>erted 10 ze<U+200B>ro-width spa
      <U+200B>ces in<U+200B>to thi<U+200B>s sentence, c<U+200B>an you tel<U+200B>
      <U+200B>l?

      (Note that the Unicode code points are shown inverted, so you can distinguish them from an ASCII character sequence of the same form).

      --
      The Tao of math: The numbers you can count are not the real numbers.
      • (Score: 3, Interesting) by FatPhil on Friday April 06 2018, @08:00AM

        by FatPhil (863) <{pc-soylent} {at} {asdf.fi}> on Friday April 06 2018, @08:00AM (#663328) Homepage
        Good point. TMTOWTDT is good. But is this the Unix way? Personally, I don't believe that's less's job, it should be a pager with scrollback, and very little more - I don't even see a switch to turn it off, unless that's what -r is for, and in that case, it's terribly documented (non-ASCII utf-8 isn't control characters). And don't get me started on cat -t!

        At least less implemented the escaping functionality correctly, locale aware - when you unset LANG you'll get:
        F<E2><80><8B>or exam<E2><80><8B>ple, I<E2><80><99>ve ins<E2><80><8B>erted 10 ze<E2><80><8B>ro-width spa<E2><80><8B>ces in<E2><80><8B>to thi<E2><80><8B>s sentence, c<E2><80><8B>an you tel<E2><80><8B><E2><80><8B>l?

        Which turns unicode into moar garbage, which I think is fitting.
        --
        Great minds discuss ideas; average minds discuss events; small minds discuss people; the smallest discuss themselves
    • (Score: 2) by maxwell demon on Friday April 06 2018, @08:19AM (1 child)

      by maxwell demon (1608) on Friday April 06 2018, @08:19AM (#663330) Journal

      wc -c does not count characters, but bytes. With multi-byte character sets (like Unicode) both are not the same. From the wc man page:

      DESCRIPTION
      [...]

             -c, --bytes
                    print the byte counts

             -m, --chars
                    print the character counts

      With the correct option, you get:

      $ echo -n 'F​or exam​ple, I’ve ins​erted 10 ze​ro-width spa​ces in​to thi​s sentence, c​an you tel​​l?' | tr -d '[[:print:]]' | wc -m
      11

      Well, it's still one too many, right? Well, no:

      $ echo 'F​or exam​ple, I’ve ins​erted 10 ze​ro-width spa​ces in​to thi​s sentence, c​an you tel​​l?' | tr -d '[[:print:]]'
      ’​​​​​​​​

      So tr considers that apostrophe as non-printable. It clearly is not a zero-width space, so there remain 10 zero-width spaces. Why is that? Well, let's look at it:

      $ echo -n '’' | xxd
      0000000: e280 99                                  ...

      This is actually the following Unicode character:

      U+2019 RIGHT SINGLE QUOTATION MARK
      UTF-8: 0xE2 0x80 0x99

      Conclusions:

      1. The author can count.
      2. You don't know your tools.
      3. tr doesn't correctly classify Unicode characters.
      --
      The Tao of math: The numbers you can count are not the real numbers.
      • (Score: 2) by FatPhil on Friday April 06 2018, @08:49AM

        by FatPhil (863) <{pc-soylent} {at} {asdf.fi}> on Friday April 06 2018, @08:49AM (#663342) Homepage
        But:
        1) The author uses specific quotation marks as apostrophes. That's as big a mistake as miscounting would have been.
        2) Once I'd od'd it, which indeed is what I did first, I saw they were all 3-byte, so 33 bytes tells me exactly the same information as 11 characters. I would also have been interested in knowing about non-utf8 byte sequences, even if they would invalidate the stream (but Postel's law...)
        3) That's a weird one, I presume a standard library is used, and that should get things right (as the unicode consortium provide an explicit list of all the classes). Someone who gives a fuck about unicode should file a bug report. (So not me.)
        --
        Great minds discuss ideas; average minds discuss events; small minds discuss people; the smallest discuss themselves
    • (Score: 0) by Anonymous Coward on Friday April 06 2018, @03:06PM

      by Anonymous Coward on Friday April 06 2018, @03:06PM (#663446)

      did anyone else think their username would show up?

      from an example from another site that was posted here that was described as a means of hiding spaces and not revealing usernames of special forum that used different but similar techniques to identify their leaker via controlled circumstances of logged in users of that site and not logged in users of some other site the author probably hasn't visited?

      i want to know if the writing is a bad example of instruction or if the expectation was not widespread

  • (Score: 2) by Rivenaleem on Friday April 06 2018, @01:02PM

    by Rivenaleem (3400) on Friday April 06 2018, @01:02PM (#663404)

    I tried using the webpage https://umpox.github.io/zero-width-detection/ [github.io] [github.io] and it did not show my username. When I pasted NewNic it into Diff Checker, there were no zero width characters.

    I tried using Firefox and Chrome under Linux.

    I dunno, it worked fine for me.