Slash Boxes

SoylentNews is people

SoylentNews is powered by your submissions, so send in your scoop. Only 17 submissions in the queue.

Submission Preview

Link to Story

Always Use UTF-8 and Always Label Your HTML to Say So

Accepted submission by canopic jug at 2020-02-22 06:43:23
Helsinki-based software developer, Henri Sivonen, has written a pair of blog posts about why web pages must specify UTF-8 encoding explicitly [] and why just having web browsers assume UTF-8 would not work [].

UTF-8 has won. Yet, Web authors have to opt in to having browsers treat HTML as UTF-8 instead of the browsers Just Doing the Right Thing by default. Why?

I’m writing this down in comprehensive form, because otherwise I will keep rewriting unsatisfactory partial explanations repeatedly as bug comments again and again. For more on how to label, see another writeup [].

Legacy Content Won’t Be Opting Out

First of all, there is the “Support Existing Content []” design principle. Browsers can’t just default to UTF-8 and have HTML documents encoded in legacy encodings opt out of UTF-8, because there is unlabeled legacy content, and we can’t realistically expect the legacy content to be actively maintained to add opt-outs now. If we are to keep supporting such legacy content, the assumption we have to start with is that unlabeled content could be in a legacy encoding.

In this regard, <meta charset=utf-8> is just like <!DOCTYPE html> and <meta name="viewport" content="width=device-width, initial-scale=1">. Everyone wants newly-authored content to use UTF-8, the No-Quirks Mode (better known as the Standards Mode), and to work well on small screens. Yet, every single newly-authored HTML document has to explicitly opt in to all three, since it isn’t realistic to get all legacy pages to opt out.

Earlier on SN:
Validating UTF-8 Strings Using As Little As 0.7 Cycles Per Byte [] (2018)
Announcing UTF-8 Support on SoylentNews [] (2014)

Original Submission