Slash Boxes

SoylentNews is people

SoylentNews is powered by your submissions, so send in your scoop. Only 12 submissions in the queue.
posted by martyb on Sunday February 23 2020, @10:30AM   Printer-friendly

Helsinki-based software developer, Henri Sivonen, has written a pair of blog posts about UTF-8; why it should be used and how to inform the user agent when it is used.

The first blog post explains problems that can arise when UTF-8 is used without explicitly stating so. Here is a short selection from Why Supporting Unlabeled UTF-8 in HTML on the Web Would Be Problematic:

UTF-8 has won. Yet, Web authors have to opt in to having browsers treat HTML as UTF-8 instead of the browsers Just Doing the Right Thing by default. Why?

I'm writing this down in comprehensive form, because otherwise I will keep rewriting unsatisfactory partial explanations repeatedly as bug comments again and again. For more on how to label, see another writeup.

Legacy Content Won't Be Opting Out

First of all, there is the "Support Existing Content" design principle. Browsers can't just default to UTF-8 and have HTML documents encoded in legacy encodings opt out of UTF-8, because there is unlabeled legacy content, and we can't realistically expect the legacy content to be actively maintained to add opt-outs now. If we are to keep supporting such legacy content, the assumption we have to start with is that unlabeled content could be in a legacy encoding.

In this regard, <meta charset=utf-8> is just like <!DOCTYPE html> and <meta name="viewport" content="width=device-width, initial-scale=1">. Everyone wants newly-authored content to use UTF-8, the No-Quirks Mode (better known as the Standards Mode), and to work well on small screens. Yet, every single newly-authored HTML document has to explicitly opt in to all three, since it isn't realistic to get all legacy pages to opt out.

The second blog post explains how one explicitly communicates to the user agent that UTF-8 is employed in the current document. Always Use UTF-8 & Always Label Your HTML Saying So:

To avoid having to deal with escapes (other than for , &, and "), to avoid data loss in form submission, to avoid XSS when serving user-provided content, and to comply with the HTML Standard, always encode your HTML as UTF-8. Furthermore, in order to let browsers know that the document is UTF-8-encoded, always label it as such. To label your document, you need to do at least one of the following:

  • Put as the first thing after the start tag (i.e. as the first child of head).

    The meta tag, including its ending > character needs to be within the first 1024 bytes of the file. Putting it right after is the easiest way to get this right. Do not put comments before .

  • Configure your server to send the header Content-Type: text/html; charset=utf-8 on the HTTP layer.

  • Start the document with the UTF-8 BOM, i.e. the bytes 0xEF, 0xBB, and 0xBF.

Doing more than one of these is OK.

NB: SoylentNews announced UTF-8 support on 2014-08-18: Site Update: Slashcode 14.08 - Now With UTF-8 Support (And Other News), just 6 months after the site was launched! One of our developers volunteered to do the implementation for them (the code for this site is a fork of the code that underlies slashdot). The offer was declined. A quick check before posting this story still fails to show Unicode/UTF-8 support.

Earlier on SN:
Validating UTF-8 Strings Using As Little As 0.7 Cycles Per Byte (2018)
Announcing UTF-8 Support on SoylentNews (2014)

Original Submission

This discussion has been archived. No new comments can be posted.
Display Options Threshold/Breakthrough Mark All as Read Mark All as Unread
The Fine Print: The following comments are owned by whoever posted them. We are not responsible for them in any way.
  • (Score: 2) by darkfeline on Sunday February 23 2020, @12:20PM (2 children)

    by darkfeline (1030) on Sunday February 23 2020, @12:20PM (#961373) Homepage

    As someone who's written lexers/parsers, I have no idea WTF you're talking about. Character encoding comes way before lexing or parsing. By the time your lexer gets to it, all of the interesting Unicode language characters are already decoded. So yes, UTF-8 encoded text is faster to "parse" than expanding &-entities, because you don't have to parse it at all.

    And if you're talking about UTF-8 decoding being slower than parsing & entities, that is also completely wrong. The only state you need to decode UTF-8 is checking the first few bits of the byte to see how many extra bytes to read and glom their bits together to get the code point. A clever hacker could probably do it with a one-liner of bit ops. Meanwhile, parsing a freaking &-entity requires, you know, an actual parser, which is way more complicated than a few bit ops. strtol is pretty damn expensive compared to bit ops, but you should have already known that?

    Join the SDF Public Access UNIX System today!
    Starting Score:    1  point
    Karma-Bonus Modifier   +1  

    Total Score:   2  
  • (Score: 4, Informative) by FatPhil on Sunday February 23 2020, @12:59PM

    by FatPhil (863) <reversethis-{if.fdsa} {ta} {tnelyos-cp}> on Sunday February 23 2020, @12:59PM (#961382) Homepage
    > Character encoding comes way before lexing or parsing.

    You clearly have no idea what you're talking about. "Encoding" is on the way out. "Decoding" is what lexers do with stuff on the way in.

    I'm defining "lexing" to be what you do on the input stream, namely that of octets, to create tokens. This is the C language version of what a 'character' is - a byte. If you're taking the more human definition of input characters to mean already-decoded into internal representation, so be it, but you're interposing an additional layer, which will affect efficiency. Note that you don't even have "characters" once you've done your utf-8 decoding, you have 'code points'. Your 'code points' are no more 'characters' than my 'bytes' are according the UC, so you can't pretend to be taking a superior stance.
    Great minds discuss ideas; average minds discuss events; small minds discuss people; the smallest discuss themselves
  • (Score: 2) by FatPhil on Sunday February 23 2020, @04:40PM

    by FatPhil (863) <reversethis-{if.fdsa} {ta} {tnelyos-cp}> on Sunday February 23 2020, @04:40PM (#961442) Homepage
    > parsing a freaking &-entity requires, you know, an actual parser

    That's not true either - that's a classic lexing job with a pretty simple grammar as long as you're prepared to be strict, and say a big "fuck you" to HTML5 which lets the entities not self-terminate, which is retarded. However, you shouldn't be interpreting such things at this layer anyway - the *token* you get is the whole string, you don't look inside it at that point. There's a chance you might not ever even need to look inside that string (for example if it's a property of a style that's never invoked).
    Great minds discuss ideas; average minds discuss events; small minds discuss people; the smallest discuss themselves