Stories
Slash Boxes
Comments

SoylentNews is people

posted by Fnord666 on Thursday April 05 2018, @08:27PM   Printer-friendly
from the digital-fingerprints dept.

Zero-width characters are invisible, ‘non-printing’ characters that are not displayed by the majority of applications. F​or exam​ple, I’ve ins​erted 10 ze​ro-width spa​ces in​to thi​s sentence, c​an you tel​​l? (Hint: paste the sentence into Diff Checker to see the locations of the characters!). These characters can be used to ‘fingerprint’ text for certain users.

Well, the original reason isn’t too exciting. A few years ago I was a member of a team that participated in competitive tournaments across a variety of video games. This team had a private message board, used to post important announcements amongst other things. Eventually these announcements would appear elsewhere on the web, posted to mock the team and more significantly; ensuring the message board was redundant for sharing confidential information and tactics.

The security of the site seemed pretty tight so the theory was that a logged-in user was simply copying the announcement and posting it elsewhere. I created a script that allowed the team to invisibly fingerprint each announcement with the username of the user it is being displayed to.

I saw a lot of interest in zero-width characters from a recent post by Zach Aysan so I thought I’d publish this method here along with an interactive demo to share with everyone. The code examples have been updated to use modern JavaScript but the overall logic is the same.


Original Submission

 
This discussion has been archived. No new comments can be posted.
Display Options Threshold/Breakthrough Mark All as Read Mark All as Unread
The Fine Print: The following comments are owned by whoever posted them. We are not responsible for them in any way.
  • (Score: 3, Interesting) by coolgopher on Thursday April 05 2018, @11:24PM (9 children)

    by coolgopher (1157) on Thursday April 05 2018, @11:24PM (#663173)

    Okay, so we all know that Unicode went off the deep end quite some time ago, but I'm left scratching my head as to why any script would need a zero-width character for *anything*?
    I could've understood it if these were the left-overs of ascii control characters, but the zero-width space used here is codepoint 0x200b. So why? Other than watermarking purposes?

    Starting Score:    1  point
    Moderation   +1  
       Interesting=1, Total=1
    Extra 'Interesting' Modifier   0  
    Karma-Bonus Modifier   +1  

    Total Score:   3  
  • (Score: 4, Informative) by takyon on Thursday April 05 2018, @11:50PM (8 children)

    by takyon (881) <takyonNO@SPAMsoylentnews.org> on Thursday April 05 2018, @11:50PM (#663184) Journal

    The Zero Width Joiner [emojipedia.org] is used to make emoji combos [emojipedia.org] like the pirate flag [emojipedia.org] or Woman Scientist: Medium-Dark Skin Tone [emojipedia.org]. Zero Width Space [wikipedia.org] can be used to break up long character sequences (for word wrap) without actually adding spaces.

    The Zero Width Non-Joiner [wikipedia.org] is used to break up the default behavior of some languages, like Arabic.

    If there are more, I'm not sure what they're for.

    --
    [SIG] 10/28/2017: Soylent Upgrade v14 [soylentnews.org]
    • (Score: 4, Insightful) by coolgopher on Friday April 06 2018, @02:45AM (7 children)

      by coolgopher (1157) on Friday April 06 2018, @02:45AM (#663248)

      So once again we've managed to mix in *presentation* information into our *content* information.

      I guess that's the nature of the beast? Even the paragraph break here is as much presentation as content. It still gives me a gut-feeling of bad design though.

      • (Score: 2) by FatPhil on Friday April 06 2018, @07:10AM (2 children)

        by FatPhil (863) <{pc-soylent} {at} {asdf.fi}> on Friday April 06 2018, @07:10AM (#663311) Homepage
        > So once again we've managed to mix in *presentation* information into our *content* information.

        Which reminds me of another way this could be performed. Why use invisible characters, when you can simply use lookalike characters. Define a set of characters you think look identical (there are probably 8 ascii letters which have cyrillic lookalikes), and every time you encounter one of these, encode a bit. Bonus points for also using "fi" vs. "fi-ligature" too, which is one of my pet peeves about PDFs, normal English words get mangled into unicode monstrosties that propagate through copy/paste. Don't impose your kerning on me, which to rub shit into the wound, I also find utt-ugly, I just want the text, you know, the *portable* stuff.
        --
        Great minds discuss ideas; average minds discuss events; small minds discuss people; the smallest discuss themselves
        • (Score: 2) by coolgopher on Friday April 06 2018, @07:56AM (1 child)

          by coolgopher (1157) on Friday April 06 2018, @07:56AM (#663325)

          Doesn't the P in PDF stand for "Painful"?

          • (Score: 2) by FatPhil on Friday April 06 2018, @08:41AM

            by FatPhil (863) <{pc-soylent} {at} {asdf.fi}> on Friday April 06 2018, @08:41AM (#663339) Homepage
            Painful Document Fucker works for me!
            --
            Great minds discuss ideas; average minds discuss events; small minds discuss people; the smallest discuss themselves
      • (Score: 2) by takyon on Friday April 06 2018, @07:57AM

        by takyon (881) <takyonNO@SPAMsoylentnews.org> on Friday April 06 2018, @07:57AM (#663326) Journal

        Some of the languages in Unicode are two complex to represent the way we do with English, so they need special control characters.

        Joiners for emoji are pretty much a no-brainer, massively increasing what can be represented while taking up fewer code points.

        I really don't see a problem.

        --
        [SIG] 10/28/2017: Soylent Upgrade v14 [soylentnews.org]
      • (Score: 3, Interesting) by darkfeline on Saturday April 07 2018, @03:33AM (2 children)

        by darkfeline (1030) on Saturday April 07 2018, @03:33AM (#663646) Homepage

        Uh, not really?

        Disregarding emoticon bullshit, the use cases for ZWS and ZWNJ cited by GP add semantic information to the content that could not be added post hoc by the rendering layer without guessing or some additional encoding, markup or formatting language.

        I mean, technically all non-ASCII and non-printable characters could be considered *presentation*, as you say, rather than *content*, by encoding everything as UTF-8 and then into base64, but I don't think most cultures appreciate being told that their language must be encoded as ASCII gibberish at the *content* level and must be rendered at the *presentation* level to be decipherable, and in any case you're back to square one where you have to standardize an additional markup/formatting/encoding language on top of Unicode and UTF-N.

        --
        Join the SDF Public Access UNIX System today!
        • (Score: 2) by coolgopher on Saturday April 07 2018, @10:41AM (1 child)

          by coolgopher (1157) on Saturday April 07 2018, @10:41AM (#663718)

          Sure, we've already seen how pleasant punycode is with the IDNs.

          We're still trying to come to grips with how to deal with content and presentation sanely though. On one end of the spectrum you have a fully-rendered image, conveying both exactly as the originator provided (subject to scale, colour profile, etc), and on the other you have, well, what do you have? A bunch of standalone symbols in a well-defined interchange format which can be strung together to form meaning? A bunch of symbols together with how-to-string-them-together information? The image end of the spectrum is seriously painful to machine process, the other direction a lot less so, until you want to do it correctly at which point you inevitably discover that you're dealing with a flawed model.

          Going even more meta, all of this is already a lossy encoding of the intended meaning of the originator (as would recorded speech be). How much lossiness in each encoding layer (idea -> speech/mental-speech -> text & presentation-> text encoding) is acceptable? How much can we compensate for with good design?

          I really don't have good answers - as I wrote above, the mix of content and presentation seems to be an innate property. It still *feels* like it should be possible to design a better model though.

          Do we really need to mix in language/script-specific rules/mechanics into languages/scripts where they don't belong?

          • (Score: 2) by darkfeline on Saturday April 07 2018, @09:27PM

            by darkfeline (1030) on Saturday April 07 2018, @09:27PM (#663811) Homepage

            Yes, deciding what counts and doesn't count as a script character to be added to Unicode is difficult, and the Unicode Consortium haven't been doing the best job, but I don't think it's debatable that some number of non-printable or "control" characters will need to be included. Simply in the general case, a language could contain all manner of idiosyncratic rules for its written script, and Unicode should be capable of representing that faithfully.

            --
            Join the SDF Public Access UNIX System today!