Stories
Slash Boxes
Comments

SoylentNews is people

Submission Preview

Link to Story

Always Use UTF-8 and Always Label Your HTML to Say So

Accepted submission by canopic jug at 2020-02-22 06:43:23
Software
Helsinki-based software developer, Henri Sivonen, has written a pair of blog posts about why web pages must specify UTF-8 encoding explicitly [hsivonen.fi] and why just having web browsers assume UTF-8 would not work [hsivonen.fi].

UTF-8 has won. Yet, Web authors have to opt in to having browsers treat HTML as UTF-8 instead of the browsers Just Doing the Right Thing by default. Why?

I’m writing this down in comprehensive form, because otherwise I will keep rewriting unsatisfactory partial explanations repeatedly as bug comments again and again. For more on how to label, see another writeup [soylentnews.org].

Legacy Content Won’t Be Opting Out

First of all, there is the “Support Existing Content [w3.org]” design principle. Browsers can’t just default to UTF-8 and have HTML documents encoded in legacy encodings opt out of UTF-8, because there is unlabeled legacy content, and we can’t realistically expect the legacy content to be actively maintained to add opt-outs now. If we are to keep supporting such legacy content, the assumption we have to start with is that unlabeled content could be in a legacy encoding.

In this regard, <meta charset=utf-8> is just like <!DOCTYPE html> and <meta name="viewport" content="width=device-width, initial-scale=1">. Everyone wants newly-authored content to use UTF-8, the No-Quirks Mode (better known as the Standards Mode), and to work well on small screens. Yet, every single newly-authored HTML document has to explicitly opt in to all three, since it isn’t realistic to get all legacy pages to opt out.

Earlier on SN:
Validating UTF-8 Strings Using As Little As 0.7 Cycles Per Byte [soylentnews.org] (2018)
Announcing UTF-8 Support on SoylentNews [soylentnews.org] (2014)


Original Submission