SoylentNews
SoylentNews is people
https://soylentnews.org/

Title    Always Use UTF-8 and Always Label Your HTML to Say So
Date    Sunday February 23 2020, @10:30AM
Author    martyb
Topic   
from the dept.
https://soylentnews.org/article.pl?sid=20/02/22/1758201

canopic jug submitted a story which was the inspiration for:

Helsinki-based software developer, Henri Sivonen, has written a pair of blog posts about UTF-8; why it should be used and how to inform the user agent when it is used.

The first blog post explains problems that can arise when UTF-8 is used without explicitly stating so. Here is a short selection from Why Supporting Unlabeled UTF-8 in HTML on the Web Would Be Problematic:

UTF-8 has won. Yet, Web authors have to opt in to having browsers treat HTML as UTF-8 instead of the browsers Just Doing the Right Thing by default. Why?

I'm writing this down in comprehensive form, because otherwise I will keep rewriting unsatisfactory partial explanations repeatedly as bug comments again and again. For more on how to label, see another writeup.

Legacy Content Won't Be Opting Out

First of all, there is the "Support Existing Content" design principle. Browsers can't just default to UTF-8 and have HTML documents encoded in legacy encodings opt out of UTF-8, because there is unlabeled legacy content, and we can't realistically expect the legacy content to be actively maintained to add opt-outs now. If we are to keep supporting such legacy content, the assumption we have to start with is that unlabeled content could be in a legacy encoding.

In this regard, <meta charset=utf-8> is just like <!DOCTYPE html> and <meta name="viewport" content="width=device-width, initial-scale=1">. Everyone wants newly-authored content to use UTF-8, the No-Quirks Mode (better known as the Standards Mode), and to work well on small screens. Yet, every single newly-authored HTML document has to explicitly opt in to all three, since it isn't realistic to get all legacy pages to opt out.

The second blog post explains how one explicitly communicates to the user agent that UTF-8 is employed in the current document. Always Use UTF-8 & Always Label Your HTML Saying So:

To avoid having to deal with escapes (other than for , &, and "), to avoid data loss in form submission, to avoid XSS when serving user-provided content, and to comply with the HTML Standard, always encode your HTML as UTF-8. Furthermore, in order to let browsers know that the document is UTF-8-encoded, always label it as such. To label your document, you need to do at least one of the following:

Doing more than one of these is OK.

NB: SoylentNews announced UTF-8 support on 2014-08-18: Site Update: Slashcode 14.08 - Now With UTF-8 Support (And Other News), just 6 months after the site was launched! One of our developers volunteered to do the implementation for them (the code for this site is a fork of the code that underlies slashdot). The offer was declined. A quick check before posting this story still fails to show Unicode/UTF-8 support.

Earlier on SN:
Validating UTF-8 Strings Using As Little As 0.7 Cycles Per Byte (2018)
Announcing UTF-8 Support on SoylentNews (2014)


Original Submission

Links

  1. "canopic jug" - https://soylentnews.org/~canopic+jug/
  2. "Why Supporting Unlabeled UTF-8 in HTML on the Web Would Be Problematic" - https://hsivonen.fi/utf-8-detection/
  3. "another writeup" - https://soylentnews.org/label-utf-8/
  4. "Support Existing Content" - https://www.w3.org/TR/html-design-principles/#support-existing-content
  5. "Always Use UTF-8 & Always Label Your HTML Saying So" - https://hsivonen.fi/label-utf-8/
  6. "to comply with the HTML Standard" - https://html.spec.whatwg.org/multipage/semantics.html#charset
  7. "Site Update: Slashcode 14.08 - Now With UTF-8 Support (And Other News)" - https://soylentnews.org/article.pl?sid=14/08/18/1023215
  8. "Validating UTF-8 Strings Using As Little As 0.7 Cycles Per Byte" - https://soylentnews.org/article.pl?sid=18/05/16/195221
  9. "Announcing UTF-8 Support on SoylentNews" - https://soylentnews.org/article.pl?sid=14/02/16/2220240
  10. "Original Submission" - https://soylentnews.org/submit.pl?op=viewsub&subid=39319

© Copyright 2022 - SoylentNews, All Rights Reserved

printed from SoylentNews, Always Use UTF-8 and Always Label Your HTML to Say So on 2022-12-09 16:28:36