SoylentNews Comments | Be Careful What You Copy: Invisibly Inserting Usernames Into Text

Be Careful What You Copy: Invisibly Inserting Usernames Into Text

posted by Fnord666 on Thursday April 05 2018, @08:27PM

from the digital-fingerprints dept.

Arthur T Knackerbracket has found the following story:

Zero-width characters are invisible, ‘non-printing’ characters that are not displayed by the majority of applications. For example, I’ve inserted 10 zero-width spaces into this sentence, can you tell? (Hint: paste the sentence into Diff Checker to see the locations of the characters!). These characters can be used to ‘fingerprint’ text for certain users.
Well, the original reason isn’t too exciting. A few years ago I was a member of a team that participated in competitive tournaments across a variety of video games. This team had a private message board, used to post important announcements amongst other things. Eventually these announcements would appear elsewhere on the web, posted to mock the team and more significantly; ensuring the message board was redundant for sharing confidential information and tactics.
The security of the site seemed pretty tight so the theory was that a logged-in user was simply copying the announcement and posting it elsewhere. I created a script that allowed the team to invisibly fingerprint each announcement with the username of the user it is being displayed to.
I saw a lot of interest in zero-width characters from a recent post by Zach Aysan so I thought I’d publish this method here along with an interactive demo to share with everyone. The code examples have been updated to use modern JavaScript but the overall logic is the same.

Original Submission

This discussion has been archived. No new comments can be posted.

Be Careful What You Copy: Invisibly Inserting Usernames Into Text | Log In/Create an Account | Top | 91 comments | Search Discussion

The Fine Print: The following comments are owned by whoever posted them. We are not responsible for them in any way.

Can also be done with standard ASCII Can also be done with standard ASCII (Score: 5, Interesting) by pipedwho on Thursday April 05 2018, @10:24PM (8 children)

by pipedwho (2032) on Thursday April 05 2018, @10:24PM (#663139)

Back before unicode we'd watermark text in all sorts of ways. When there's lots of text, it's much easier to hide the watermark.

We started doing things like add an extra space or two before EOL, combinations of spaces and tabs, extra spaces after punctuation, and occasionally subbing in 1 of l and I for l, 0 for O, etc.

To avoid detection, we escalated up to intentional misspellings at certain word offsets, along with word/phrase substitutions and insertions.

Since we knew the original text, we could determine where these changes happened, so didn't need a system to decode a message with blind encoded source text. This is a method of differential steganography, where a message is encoded into the difference between the transmitted and a known/shared source input.

When the transmitted message is very short, regular ASCII and text transformation doesn't leave a lot of bandwidth for hiding/encoding more than a few bits. However, using a few bits each over many messages, repeat offenders will soon provide sufficient bandwidth to encode a longer message such as a username or IP address, eventually identifying themselves over a longer period of time.

Thanks to Unicode, the bandwidth for this sort of sub-channel messaging is greatly increased. Easily detectable methods like zero size characters or other non-local language substitutions can be programmatically stripped/removed. However, that may not be an issue if the leaker is unaware of the watermark and is using some simple tool to scrape and copy.

Starting Score:	1		point
Moderation		+3
Interesting=2, Informative=1, Total=3
Extra 'Interesting' Modifier		0
Karma-Bonus Modifier		+1

Total Score:		5

Re:Can also be done with standard ASCII Re:Can also be done with standard ASCII (Score: 3, Interesting) by maxwell demon on Thursday April 05 2018, @10:36PM (7 children)

by maxwell demon (1608) on Thursday April 05 2018, @10:36PM (#663148) Journal

Of course you can combine several ways. For example, one user may be aware of the trailing space method and simply strip all trailing spaces away, but may be unaware of zero-width spaces. While another user might strip zero-width spaces away, but leave trailing spaces intact. If the information is encoded in both ways, then it will survive for both users.

--
The Tao of math: The numbers you can count are not the real numbers.

Parent
- Re:Can also be done with standard ASCII Re:Can also be done with standard ASCII (Score: 0) by Anonymous Coward on Thursday April 05 2018, @10:44PM (6 children)
  
  by Anonymous Coward on Thursday April 05 2018, @10:44PM (#663153)
  
  If the information is encoded in both ways, then it will survive for both users.
  Instead of encoding the same information in two different ways, it is much more reliable to use an error-correcting code, using a variety of methods to store each code symbol (extra spaces, zero-width spaces, mispelled words, formatting changes, or what-have-you).
  
  Parent
  - Re:Can also be done with standard ASCII Re:Can also be done with standard ASCII (Score: 2) by pipedwho on Friday April 06 2018, @02:43AM (5 children)
    
    by pipedwho (2032) on Friday April 06 2018, @02:43AM (#663247)
    
    ECC is a great idea. It helps with multiple small messages where you're not sure which ones will get retransmitted. ECC does a better job with known erasures than just random errors, so it works well with this sort of system.
    Also, using multiples/duplication and interleaving along with ECC greatly improve recovery where large amounts of data (ie. > 50%) is 'erased'. Especially useful when a whole section is stripped (eg. all the zero size bits, or all the spaces, etc).
    
    Parent
    - Re:Can also be done with standard ASCII Re:Can also be done with standard ASCII (Score: 1) by anubi on Friday April 06 2018, @06:42AM (4 children)
      
      by anubi (2828) on Friday April 06 2018, @06:42AM (#663302) Journal
      
      Dump off to hardcopy, proofread, then OCR back to plaintext?
      
      --
      "Prove all things; hold fast that which is good." [KJV: I Thessalonians 5:21]
      
      Parent
      - Re:Can also be done with standard ASCII (Score: 0) by Anonymous Coward on Friday April 06 2018, @12:01PM
        
        by Anonymous Coward on Friday April 06 2018, @12:01PM (#663388)
        
        This is what I too thought. I'd spellcheck after OCR. Also perhaps instead of hardcopy screenshots would do.
        
        Parent
      - Re:Can also be done with standard ASCII Re:Can also be done with standard ASCII (Score: 2) by pipedwho on Friday April 06 2018, @09:02PM (2 children)
        
        by pipedwho (2032) on Friday April 06 2018, @09:02PM (#663547)
        
        Doesn’t get rid of phrase and word substitutions, which can be done manually (or to some degree automatically) on an A/B basis throughout the document. Then the system that allows access and stores both copies (or diff streams) finds the diff sections and gives you an A or B to encode a bit. The only way around this is to get two or more versions of the document and diff them yourself to remodulate those elements. Even retyping the document ‘in your own words’ may leak some inserted encoding (such as a modulated ‘fact’ like a percentage changed below its error bounds (eg 49.5 changed to 48.6), or a false but unimportant name/fact added to a list, etc).
        
        Parent
        
        Re:Can also be done with standard ASCII Re:Can also be done with standard ASCII (Score: 2) by Osamabobama on Friday April 06 2018, @10:43PM (1 child)
        
        by Osamabobama (5842) on Friday April 06 2018, @10:43PM (#663571)
        
        Some possible encoding can be stripped off with a round trip through a series of translators. That would leave numbers and names the same, though, presumably. If you have to try too hard to remove the data that may implicate you as the leaker, it may not be possible to leak useful information without becoming known.
        The logical end point of maximum actual risk with minimum detectable risk is where you are the only one with the document, but there is nothing encoded in the text. If you are going to play cat-and-mouse games, it's better to be the cat...
        
        --
        Appended to the end of comments you post. Max: 120 chars.
        
        Parent
        
        Re:Can also be done with standard ASCII (Score: 2) by pipedwho on Saturday April 07 2018, @12:17AM
        
        by pipedwho (2032) on Saturday April 07 2018, @12:17AM (#663593)
        
        Or obtain the leaked info through someone else’s account, or a side channel without identifiable access. But, you’re right, it’s always bettter to be the cat.
        
        Parent

Moderator Help

SoylentNews

SoylentNews is people

Navigation

Sections

SoylentNews

Be Careful What You Copy: Invisibly Inserting Usernames Into Text

Can also be done with standard ASCII Can also be done with standard ASCII (Score: 5, Interesting) by pipedwho on Thursday April 05 2018, @10:24PM (8 children)

Re:Can also be done with standard ASCII Re:Can also be done with standard ASCII (Score: 3, Interesting) by maxwell demon on Thursday April 05 2018, @10:36PM (7 children)

Re:Can also be done with standard ASCII Re:Can also be done with standard ASCII (Score: 0) by Anonymous Coward on Thursday April 05 2018, @10:44PM (6 children)

Re:Can also be done with standard ASCII Re:Can also be done with standard ASCII (Score: 2) by pipedwho on Friday April 06 2018, @02:43AM (5 children)

Re:Can also be done with standard ASCII Re:Can also be done with standard ASCII (Score: 1) by anubi on Friday April 06 2018, @06:42AM (4 children)

Re:Can also be done with standard ASCII (Score: 0) by Anonymous Coward on Friday April 06 2018, @12:01PM

Re:Can also be done with standard ASCII Re:Can also be done with standard ASCII (Score: 2) by pipedwho on Friday April 06 2018, @09:02PM (2 children)

Re:Can also be done with standard ASCII Re:Can also be done with standard ASCII (Score: 2) by Osamabobama on Friday April 06 2018, @10:43PM (1 child)

Re:Can also be done with standard ASCII (Score: 2) by pipedwho on Saturday April 07 2018, @12:17AM