Stories
Slash Boxes
Comments

SoylentNews is people

SoylentNews is powered by your submissions, so send in your scoop. Only 17 submissions in the queue.

Submission Preview

Link to Story

Always Use UTF-8 and Always Label Your HTML to Say So

Accepted submission by canopic jug at 2020-02-22 06:43:23
Software
Helsinki-based software developer, Henri Sivonen, has written a pair of blog posts about why web pages must specify UTF-8 encoding explicitly [hsivonen.fi] and why just having web browsers assume UTF-8 would not work [hsivonen.fi].

UTF-8 has won. Yet, Web authors have to opt in to having browsers treat HTML as UTF-8 instead of the browsers Just Doing the Right Thing by default. Why?

I’m writing this down in comprehensive form, because otherwise I will keep rewriting unsatisfactory partial explanations repeatedly as bug comments again and again. For more on how to label, see another writeup [soylentnews.org].

Legacy Content Won’t Be Opting Out

First of all, there is the “Support Existing Content [w3.org]” design principle. Browsers can’t just default to UTF-8 and have HTML documents encoded in legacy encodings opt out of UTF-8, because there is unlabeled legacy content, and we can’t realistically expect the legacy content to be actively maintained to add opt-outs now. If we are to keep supporting such legacy content, the assumption we have to start with is that unlabeled content could be in a legacy encoding.

In this regard, <meta charset=utf-8> is just like <!DOCTYPE html> and <meta name="viewport" content="width=device-width, initial-scale=1">. Everyone wants newly-authored content to use UTF-8, the No-Quirks Mode (better known as the Standards Mode), and to work well on small screens. Yet, every single newly-authored HTML document has to explicitly opt in to all three, since it isn’t realistic to get all legacy pages to opt out.

Earlier on SN:
Validating UTF-8 Strings Using As Little As 0.7 Cycles Per Byte [soylentnews.org] (2018)
Announcing UTF-8 Support on SoylentNews [soylentnews.org] (2014)


Original Submission