Zero-width characters are invisible, ‘non-printing’ characters that are not displayed by the majority of applications. For example, I’ve inserted 10 zero-width spaces into this sentence, can you tell? (Hint: paste the sentence into Diff Checker to see the locations of the characters!). These characters can be used to ‘fingerprint’ text for certain users.
Well, the original reason isn’t too exciting. A few years ago I was a member of a team that participated in competitive tournaments across a variety of video games. This team had a private message board, used to post important announcements amongst other things. Eventually these announcements would appear elsewhere on the web, posted to mock the team and more significantly; ensuring the message board was redundant for sharing confidential information and tactics.
The security of the site seemed pretty tight so the theory was that a logged-in user was simply copying the announcement and posting it elsewhere. I created a script that allowed the team to invisibly fingerprint each announcement with the username of the user it is being displayed to.
I saw a lot of interest in zero-width characters from a recent post by Zach Aysan so I thought I’d publish this method here along with an interactive demo to share with everyone. The code examples have been updated to use modern JavaScript but the overall logic is the same.
(Score: 2) by NewNic on Thursday April 05 2018, @08:39PM (14 children)
I tried using the webpage https://umpox.github.io/zero-width-detection/ [github.io] and it did not show my username. When I pasted it into Diff Checker, there were no zero width characters.
I tried using Firefox and Chrome under Linux.
lib·er·tar·i·an·ism ˌlibərˈterēənizəm/ noun: Magical thinking that useful idiots mistake for serious political theory
(Score: 1, Funny) by Anonymous Coward on Thursday April 05 2018, @08:51PM
You need to be wearing your captain crunch secret decoder ring.
(Score: 0) by Anonymous Coward on Thursday April 05 2018, @09:35PM
Worked for me pasting into hexdump -C both in MSYS2/Windows and GNU/Linux.
(Score: 1) by speederaser on Thursday April 05 2018, @11:06PM (4 children)
In Firefox and PaleMoon you can see them by right-clicking -> view page source. The inserted string shows as: "& # 8 2 0 3 ;" without the double quotes or the actual spaces inserted.
(Score: 2) by maxwell demon on Friday April 06 2018, @06:41AM (3 children)
So in other words, it shows as “​” — why not just write that?
But then, this is not because of Firefox/Palemoon, but because that's what the site delivers to the browser. If the site had decided to deliver actual Unicode characters instead of HTML entities, then your browser would not show entities in the source.
The Tao of math: The numbers you can count are not the real numbers.
(Score: 1) by speederaser on Friday April 06 2018, @04:10PM (2 children)
Because "preview" made it invisible when I did it that way, even when I used "Plain Old Text". Just like it appears in preview on this post.
(Score: 2) by maxwell demon on Friday April 06 2018, @04:53PM
Hint: &
The Tao of math: The numbers you can count are not the real numbers.
(Score: 2) by Osamabobama on Friday April 06 2018, @09:55PM
My favorite is when the html sorcery is correct, so preview looks fine, but the text-entry window version of the comment also gets changed, so hitting Submit (or Preview, again) will post something else.
For instance, if you want to show <i>html tags</i> in the post, you format your comment with escape characters so the correct tag shows up in the preview. However, the comment text also strips out the escape characters, so pressing Submit will then post the comment with the tags interpreted, rather than displayed.
Each cycle is slightly different:
Appended to the end of comments you post. Max: 120 chars.
(Score: 3, Interesting) by FatPhil on Friday April 06 2018, @06:29AM (5 children)
$ echo -n 'For example, I’ve inserted 10 zero-width spaces into this sentence, can you tell?' | od -c
0000000 F 342 200 213 o r e x a m 342 200 213 p l
0000020 e , I 342 200 231 v e i n s 342 200 213
0000040 e r t e d 1 0 z e 342 200 213 r o
0000060 - w i d t h s p a 342 200 213 c e s
0000100 i n 342 200 213 t o t h i 342 200 213 s
0000120 s e n t e n c e , c 342 200 213 a
0000140 n y o u t e l 342 200 213 342 200 213 l
0000160 ?
0000161
Or if you just want to count them:
$ echo -n 'For example, I’ve inserted 10 zero-width spaces into this sentence, can you tell?' | tr -d '[[:print:]]' | wc -c
33
All of which means the author can't count.
However, you're misunderstanding him, he never claimed to have embedded your username in that sentence, only to have embedded some non-displaying characters.
Great minds discuss ideas; average minds discuss events; small minds discuss people; the smallest discuss themselves
(Score: 3, Interesting) by maxwell demon on Friday April 06 2018, @06:55AM (1 child)
Actually, you can just look at it with less. Then you even get the Unicode code numbers in a readable form:
(Note that the Unicode code points are shown inverted, so you can distinguish them from an ASCII character sequence of the same form).
The Tao of math: The numbers you can count are not the real numbers.
(Score: 3, Interesting) by FatPhil on Friday April 06 2018, @08:00AM
At least less implemented the escaping functionality correctly, locale aware - when you unset LANG you'll get:
F<E2><80><8B>or exam<E2><80><8B>ple, I<E2><80><99>ve ins<E2><80><8B>erted 10 ze<E2><80><8B>ro-width spa<E2><80><8B>ces in<E2><80><8B>to thi<E2><80><8B>s sentence, c<E2><80><8B>an you tel<E2><80><8B><E2><80><8B>l?
Which turns unicode into moar garbage, which I think is fitting.
Great minds discuss ideas; average minds discuss events; small minds discuss people; the smallest discuss themselves
(Score: 2) by maxwell demon on Friday April 06 2018, @08:19AM (1 child)
wc -c does not count characters, but bytes. With multi-byte character sets (like Unicode) both are not the same. From the wc man page:
With the correct option, you get:
Well, it's still one too many, right? Well, no:
So tr considers that apostrophe as non-printable. It clearly is not a zero-width space, so there remain 10 zero-width spaces. Why is that? Well, let's look at it:
This is actually the following Unicode character:
Conclusions:
The Tao of math: The numbers you can count are not the real numbers.
(Score: 2) by FatPhil on Friday April 06 2018, @08:49AM
1) The author uses specific quotation marks as apostrophes. That's as big a mistake as miscounting would have been.
2) Once I'd od'd it, which indeed is what I did first, I saw they were all 3-byte, so 33 bytes tells me exactly the same information as 11 characters. I would also have been interested in knowing about non-utf8 byte sequences, even if they would invalidate the stream (but Postel's law...)
3) That's a weird one, I presume a standard library is used, and that should get things right (as the unicode consortium provide an explicit list of all the classes). Someone who gives a fuck about unicode should file a bug report. (So not me.)
Great minds discuss ideas; average minds discuss events; small minds discuss people; the smallest discuss themselves
(Score: 0) by Anonymous Coward on Friday April 06 2018, @03:06PM
did anyone else think their username would show up?
from an example from another site that was posted here that was described as a means of hiding spaces and not revealing usernames of special forum that used different but similar techniques to identify their leaker via controlled circumstances of logged in users of that site and not logged in users of some other site the author probably hasn't visited?
i want to know if the writing is a bad example of instruction or if the expectation was not widespread
(Score: 2) by Rivenaleem on Friday April 06 2018, @01:02PM
I dunno, it worked fine for me.