Stories
Slash Boxes
Comments

SoylentNews is people

posted by martyb on Sunday February 23 2020, @10:30AM   Printer-friendly

Helsinki-based software developer, Henri Sivonen, has written a pair of blog posts about UTF-8; why it should be used and how to inform the user agent when it is used.

The first blog post explains problems that can arise when UTF-8 is used without explicitly stating so. Here is a short selection from Why Supporting Unlabeled UTF-8 in HTML on the Web Would Be Problematic:

UTF-8 has won. Yet, Web authors have to opt in to having browsers treat HTML as UTF-8 instead of the browsers Just Doing the Right Thing by default. Why?

I'm writing this down in comprehensive form, because otherwise I will keep rewriting unsatisfactory partial explanations repeatedly as bug comments again and again. For more on how to label, see another writeup.

Legacy Content Won't Be Opting Out

First of all, there is the "Support Existing Content" design principle. Browsers can't just default to UTF-8 and have HTML documents encoded in legacy encodings opt out of UTF-8, because there is unlabeled legacy content, and we can't realistically expect the legacy content to be actively maintained to add opt-outs now. If we are to keep supporting such legacy content, the assumption we have to start with is that unlabeled content could be in a legacy encoding.

In this regard, <meta charset=utf-8> is just like <!DOCTYPE html> and <meta name="viewport" content="width=device-width, initial-scale=1">. Everyone wants newly-authored content to use UTF-8, the No-Quirks Mode (better known as the Standards Mode), and to work well on small screens. Yet, every single newly-authored HTML document has to explicitly opt in to all three, since it isn't realistic to get all legacy pages to opt out.

The second blog post explains how one explicitly communicates to the user agent that UTF-8 is employed in the current document. Always Use UTF-8 & Always Label Your HTML Saying So:

To avoid having to deal with escapes (other than for , &, and "), to avoid data loss in form submission, to avoid XSS when serving user-provided content, and to comply with the HTML Standard, always encode your HTML as UTF-8. Furthermore, in order to let browsers know that the document is UTF-8-encoded, always label it as such. To label your document, you need to do at least one of the following:

  • Put as the first thing after the start tag (i.e. as the first child of head).

    The meta tag, including its ending > character needs to be within the first 1024 bytes of the file. Putting it right after is the easiest way to get this right. Do not put comments before .

  • Configure your server to send the header Content-Type: text/html; charset=utf-8 on the HTTP layer.

  • Start the document with the UTF-8 BOM, i.e. the bytes 0xEF, 0xBB, and 0xBF.

Doing more than one of these is OK.

NB: SoylentNews announced UTF-8 support on 2014-08-18: Site Update: Slashcode 14.08 - Now With UTF-8 Support (And Other News), just 6 months after the site was launched! One of our developers volunteered to do the implementation for them (the code for this site is a fork of the code that underlies slashdot). The offer was declined. A quick check before posting this story still fails to show Unicode/UTF-8 support.

Earlier on SN:
Validating UTF-8 Strings Using As Little As 0.7 Cycles Per Byte (2018)
Announcing UTF-8 Support on SoylentNews (2014)


Original Submission

 
This discussion has been archived. No new comments can be posted.
Display Options Threshold/Breakthrough Mark All as Read Mark All as Unread
The Fine Print: The following comments are owned by whoever posted them. We are not responsible for them in any way.
(1)
  • (Score: 5, Insightful) by FatPhil on Sunday February 23 2020, @11:58AM (21 children)

    by FatPhil (863) <{pc-soylent} {at} {asdf.fi}> on Sunday February 23 2020, @11:58AM (#961369) Homepage
    If you have to handle any &-escaped entities at all, which you do, then for most content it's no-slower to parse unicode expressed as &entities; in plain ASCII than it is to parse the raw utf-8 encoded non-ASCII. This is because UTH-8 has a way more complicated state machine for defining what a token is (in ASCII it's trivial, basically it must be 7-bit and not junk control codes). So the initial supporting argument, the one one would expect they think is strongest if they're leading with it:
      To avoid having to deal with escapes (other than for , &, and ")
    I deny. You still have to deal with escapes - they even admit that themselves. So the "Use ASCII everywhere" argument is not dominated by the above reasoning.

    Then again, I'm explicitly racist when it comes to character sets. It's our western internet, invented in ASCII for ASCII use. If they want to play on our internet, they should have bent to our system, rather than bending our system to their whims. Or invented their own internet (they can use all of layers 1-3 for free, probably layer 4, it's only as you get up closer to the application layer that any concept of "text" becomes important).

    My opinion has been strengthened by the Unicode Consortium breaking the single thing that they were responsible for with their not-even-alphabets soup. While it was in the hands of the computer scientists, it was fine (such as the UTF-8 multi-octet encoding algorithm) - as soon as it reached international committees of me-toos who just wanted to incorporate 71 different kitchen sinks into the standard, it became a target of ridicule. I'm all for letting people express themselves how they want, but having just ASCII has been sufficient for that for billions of people for decades (it kept the linguists in a job too, as they could come up with a new transliteration scheme, each worse than the previous, every 5 years). The extension of that to incorporate extant (including archaic ones, as their study is still current) languages was also fine, minority languages need their support, they have the right to coexist as equals. It's the making shit up bit that I object to. It's turned into nothing more than clip-art now. J.K. Rowling didn't need wizard's hats in her books. She got by with stilted English sentences in ASCII. When you start throwing just any old nonsense into the thing that was supposed to help support and enable mature, precise, real world communication, you dilute any good that you had previously done, you undo the advancements previously made.

    So, yeah, handle it - you ought to be handling it nowadays, as you need to - it's infested everywhere. But perhaps the way of handling it is by filtering it out.

    I also have objections to their other objections, but this post is way too long already.
    --
    Great minds discuss ideas; average minds discuss events; small minds discuss people; the smallest discuss themselves
    • (Score: 2) by darkfeline on Sunday February 23 2020, @12:20PM (2 children)

      by darkfeline (1030) on Sunday February 23 2020, @12:20PM (#961373) Homepage

      As someone who's written lexers/parsers, I have no idea WTF you're talking about. Character encoding comes way before lexing or parsing. By the time your lexer gets to it, all of the interesting Unicode language characters are already decoded. So yes, UTF-8 encoded text is faster to "parse" than expanding &-entities, because you don't have to parse it at all.

      And if you're talking about UTF-8 decoding being slower than parsing & entities, that is also completely wrong. The only state you need to decode UTF-8 is checking the first few bits of the byte to see how many extra bytes to read and glom their bits together to get the code point. A clever hacker could probably do it with a one-liner of bit ops. Meanwhile, parsing a freaking &-entity requires, you know, an actual parser, which is way more complicated than a few bit ops. strtol is pretty damn expensive compared to bit ops, but you should have already known that?

      --
      Join the SDF Public Access UNIX System today!
      • (Score: 4, Informative) by FatPhil on Sunday February 23 2020, @12:59PM

        by FatPhil (863) <{pc-soylent} {at} {asdf.fi}> on Sunday February 23 2020, @12:59PM (#961382) Homepage
        > Character encoding comes way before lexing or parsing.

        You clearly have no idea what you're talking about. "Encoding" is on the way out. "Decoding" is what lexers do with stuff on the way in.

        I'm defining "lexing" to be what you do on the input stream, namely that of octets, to create tokens. This is the C language version of what a 'character' is - a byte. If you're taking the more human definition of input characters to mean already-decoded into internal representation, so be it, but you're interposing an additional layer, which will affect efficiency. Note that you don't even have "characters" once you've done your utf-8 decoding, you have 'code points'. Your 'code points' are no more 'characters' than my 'bytes' are according the UC, so you can't pretend to be taking a superior stance.
        --
        Great minds discuss ideas; average minds discuss events; small minds discuss people; the smallest discuss themselves
      • (Score: 2) by FatPhil on Sunday February 23 2020, @04:40PM

        by FatPhil (863) <{pc-soylent} {at} {asdf.fi}> on Sunday February 23 2020, @04:40PM (#961442) Homepage
        > parsing a freaking &-entity requires, you know, an actual parser

        That's not true either - that's a classic lexing job with a pretty simple grammar as long as you're prepared to be strict, and say a big "fuck you" to HTML5 which lets the entities not self-terminate, which is retarded. However, you shouldn't be interpreting such things at this layer anyway - the *token* you get is the whole string, you don't look inside it at that point. There's a chance you might not ever even need to look inside that string (for example if it's a property of a style that's never invoked).
        --
        Great minds discuss ideas; average minds discuss events; small minds discuss people; the smallest discuss themselves
    • (Score: 3, Funny) by FatPhil on Sunday February 23 2020, @12:43PM (4 children)

      by FatPhil (863) <{pc-soylent} {at} {asdf.fi}> on Sunday February 23 2020, @12:43PM (#961376) Homepage
      Heck, I know I'm going down in flames, so I might as well get something else off my chest...

      Anyone who uses:
          -webkit-font-feature-settings: 'liga' 1, 'dlig' 1, 'kern' 1;
          -moz-font-feature-settings: 'liga' 1, 'dlig' 1, 'kern' 1;
          -ms-font-feature-settings: 'liga' 1, 'dlig' 1, 'kern' 1;
          -o-font-feature-settings: 'liga' 1, 'dlig' 1, 'kern' 1;
          font-feature-settings: 'liga' 1, 'dlig' 1, 'kern' 1;
      is a complete tosser. That's from Sivonen's home page. Nothing uglifies webpages than those horrible 'st' ligatures.

      But that doesn't make his arguments wrong, of course. It's just that he's outed himself as a tosser, nothing more.
      --
      Great minds discuss ideas; average minds discuss events; small minds discuss people; the smallest discuss themselves
      • (Score: 2) by takyon on Sunday February 23 2020, @12:58PM (1 child)

        by takyon (881) <takyonNO@SPAMsoylentnews.org> on Sunday February 23 2020, @12:58PM (#961380) Journal

        I have not seen that. Is that in the CSS standard?

        https://developer.mozilla.org/en-US/docs/Web/CSS/font-feature-settings [mozilla.org]

        CSS Fonts Module Level 3
        The definition of 'font-feature-settings' in that specification. Candidate Recommendation

        --
        [SIG] 10/28/2017: Soylent Upgrade v14 [soylentnews.org]
        • (Score: 0) by Anonymous Coward on Sunday February 23 2020, @07:18PM

          by Anonymous Coward on Sunday February 23 2020, @07:18PM (#961509)

          I have not seen that. Is that in the CSS standard?

          Most likely not, because all of these:

          • -webkit-font-feature-settings:
          • -moz-font-feature-settings:
          • -ms-font-feature-settings:
          • -o-font-feature-settings:

          Are rendering engine specific CSS style flags to allow for new styles flags to be 'tested' before incorporation in the standards. Note these web standards are really only 'standards' in the most basic sense. The reality is that they are "lets make a common standard for what all the various engines are already doing, differently". I.e., a real world example of this XKCD: XKCD on Standards [xkcd.com].

          So this last one:

          • font-feature-settings:

          Is likely what will end up getting incorporated into some later revision of CSS. But for now it is ignored and one of the others turns the miss-feature on in your particular browser. But reality is that st litagure is so damn ugly the whole bit should be tossed into the dust bin and forgotten.

      • (Score: 2, Interesting) by Anonymous Coward on Sunday February 23 2020, @12:58PM

        by Anonymous Coward on Sunday February 23 2020, @12:58PM (#961381)

        +1 a thousand times.

        I had to go into dev tools and turn off those awful stylings to make the page readable.

        Those css styles simply should not exist.

      • (Score: 2) by maxwell demon on Sunday February 23 2020, @01:55PM

        by maxwell demon (1608) on Sunday February 23 2020, @01:55PM (#961389) Journal

        Yeah, replace "dlig" 1 with "dlig" 0, and everything gets readable again.

        --
        The Tao of math: The numbers you can count are not the real numbers.
    • (Score: 2) by maxwell demon on Sunday February 23 2020, @12:56PM (2 children)

      by maxwell demon (1608) on Sunday February 23 2020, @12:56PM (#961379) Journal

      So the initial supporting argument, the one one would expect they think is strongest if they're leading with it:

      That's not how I learned it in school. Rather, I learned to start with the weakest and end with the strongest. That way the strongest is the one that stays in the mind of the reader after finishing the text.

      --
      The Tao of math: The numbers you can count are not the real numbers.
      • (Score: 2) by The Mighty Buzzard on Sunday February 23 2020, @04:19PM

        by The Mighty Buzzard (18) Subscriber Badge <themightybuzzard@proton.me> on Sunday February 23 2020, @04:19PM (#961429) Homepage Journal

        I prefer to start with the strongest and end there. Saves time and weaker arguments aren't necessary unless your strongest one fails.

        --
        My rights don't end where your fear begins.
      • (Score: 2) by FatPhil on Sunday February 23 2020, @04:31PM

        by FatPhil (863) <{pc-soylent} {at} {asdf.fi}> on Sunday February 23 2020, @04:31PM (#961434) Homepage
        Is you start with the strongest, and that's strong enough, you can stop there - bosh!
        --
        Great minds discuss ideas; average minds discuss events; small minds discuss people; the smallest discuss themselves
    • (Score: 3, Informative) by Anonymous Coward on Sunday February 23 2020, @02:49PM (3 children)

      by Anonymous Coward on Sunday February 23 2020, @02:49PM (#961396)

      Then again, I'm explicitly racist when it comes to character sets. It's our western internet, invented in ASCII for ASCII use.

      This is not about the Internet. It's about the WWW, is an application layer protocol on top of the Internet, and you Americans didn't invent that, and as such it was not invented solely for ASCII use. Tim Berners-Lee is British, and he designed the WWW at first to serve the needs of CERN, which is at the Franco-Swiss border and employs scientists from all over the world.

      But yes, the Unicode committee has really gone and done it with that mess of emojis that they keep adding to the Standard. There really was no need for more than a handful of those. Fine, add characters for every character set in use around the world, but cut it out with all of those bizarre symbols that are rapidly becoming the 21st Century CE hieroglyphics.

      • (Score: 2) by FatPhil on Sunday February 23 2020, @04:33PM (2 children)

        by FatPhil (863) <{pc-soylent} {at} {asdf.fi}> on Sunday February 23 2020, @04:33PM (#961435) Homepage
        > you Americans

        Oi!
        --
        Great minds discuss ideas; average minds discuss events; small minds discuss people; the smallest discuss themselves
        • (Score: 0) by Anonymous Coward on Sunday February 23 2020, @05:13PM (1 child)

          by Anonymous Coward on Sunday February 23 2020, @05:13PM (#961457)
          "ASCII" => American Standard Code for Information Interchange.
          • (Score: 3, Interesting) by DannyB on Monday February 24 2020, @06:50PM

            by DannyB (5839) Subscriber Badge on Monday February 24 2020, @06:50PM (#961926) Journal

            ASCII stupid question, get a stupid ANSI.

            --
            The people who rely on government handouts and refuse to work should be kicked out of congress.
    • (Score: 4, Touché) by isj on Sunday February 23 2020, @06:58PM (4 children)

      by isj (5249) on Sunday February 23 2020, @06:58PM (#961498) Homepage

      having just ASCII has been sufficient for that for billions of people for decades

      No, it is only sufficient for the English orthography used by 379 million (source: https://en.wikipedia.org/wiki/List_of_languages_by_total_number_of_speakers). [wikipedia.org] Ok, also the Vatican state that has Latin as the official language.

      The rest of use have recognized that the latin alphabet is sufficient only for latin, and have added letters that represents sounds that we use.

      Even for english using only latin letters in orthography has problems: 26 letters to represent 24 consonants and 12-13 vowels. The introduction of the Gutenberg press damaged the English orthography. Because the typesets didn't have them the eth (ð) and thorn (þ) were lost. The Scots lost the yogh (Ȝ) (source: https://www.youtube.com/watch?v=4CtWyh49Mms). [youtube.com]

      I don't have a problem with using ascii for eg. programming language keywords, html keywords/tag-names, etc. It is just like musical notation where the song may be in German but the notation uses italian "keywords" like "forte", "piano", "con pedale". It's fine.

      As for the Unicode Consortiums addition of emojis and other stuff to to Unicode: Their state goal is to include all characters that are used for text and information. So of course Unicode includes the characters latin alphabet, extended latin alphabet, greek, ancient greek, mathematical symbols, hebrew, thai, logographs, typographical signs. They didn't include the fictional klingon characters initially because its users used transliteration. First when the users started using non-transliterated characters unicode got those added. Yes, I still find that silly but it is in line with the stated goal of unicode - people used them for text messages. More info: https://www.youtube.com/watch?v=5OPkGQoPeHk [youtube.com] (includes he really bad idea of a emoji-only chat application)

      • (Score: 2) by FatPhil on Sunday February 23 2020, @08:45PM (3 children)

        by FatPhil (863) <{pc-soylent} {at} {asdf.fi}> on Sunday February 23 2020, @08:45PM (#961544) Homepage
        I'm glad that you don't have any bad memories of our Empire, I'll take that as a compliment. Lots and lots and lots of Indians in that empire, you know.

        You also seem to think that nobody dies or is born, which is weird.
        --
        Great minds discuss ideas; average minds discuss events; small minds discuss people; the smallest discuss themselves
        • (Score: 2) by FatPhil on Sunday February 23 2020, @08:47PM (2 children)

          by FatPhil (863) <{pc-soylent} {at} {asdf.fi}> on Sunday February 23 2020, @08:47PM (#961547) Homepage
          Having said that, the fact that you think that you can counter a "has been" with an "is" shows you're really not very good at anything remotely approaching logic.
          --
          Great minds discuss ideas; average minds discuss events; small minds discuss people; the smallest discuss themselves
          • (Score: 3, Touché) by isj on Sunday February 23 2020, @09:36PM (1 child)

            by isj (5249) on Sunday February 23 2020, @09:36PM (#961568) Homepage

            The ASCII standard was published in 1963. Has there been billions (assuming short scale: 2.000.000) people with English as their first language that have interacted with computers since then?

            As for India: There are 14 writing scripts used today (source: https://bharatbhashakosh.blogspot.com/2017/06/13-writting-scripts-used-for-indian.html [blogspot.com] ), the most well-known is Devenagari. If ASCII were sufficient then surely they wouldn't use the 13 other scripts?

            If when you wrote "sufficient" you meant "can be transliterated reasonably to ASCII" then perhaps you should have a look at Euroenglish: https://www.englishforums.com/English/FiveYearPhasePlanEuroenglish/gvkwp/post.htm [englishforums.com] After you have read that you may have a better understanding of what non-english speakers think when an ignorant asks if they can't just use ASCII. Note: Italians may argue that your obscure letters J/K/W/X/Y are silly - noone civilised uses those :-)

            • (Score: 2) by FatPhil on Monday February 24 2020, @01:04PM

              by FatPhil (863) <{pc-soylent} {at} {asdf.fi}> on Monday February 24 2020, @01:04PM (#961807) Homepage
              > 1963

              Pfft! ASCII doesn't need to have been existence for it to have been sufficient for their needs. English was the national language of India during the Raj (which it inherited from being the official language under prior EIC control), it's still officially a subsidiary official language of the independent country because it's the de facto inter-regional one, and many regions fought off the attempt to impose Hindi as a national language quite vociferously.

              > If ASCII were sufficient then surely they wouldn't use the 13 other scripts?

              Is there an attempt to have any logic in your arguments at all - you seem to be leaking more and more illogic with every post. If the burger and chips were sufficient for lunch, why did I have bacon and cheese today? Because I wanted to eat them, I had the right to eat them, and I could afford to eat them. But according to your logic, surely I wouldn't have the bacon cheeseburger. But I did. So your logic's worthless.

              > Euroenglish

              Pffft! Noah Webster and Mark Twain did this sensibly to death over a century ago, and the piss-takes go back nearly as far, e.g. Shield's letter to /The Economist/: http://guidetogrammar.org/grammar/twain.htm (wrongly attributed to Twain because Twain did indeed dabble in such matters, and was as much a troll at times).

              > you may have a better understanding of what non-english speakers think when an ignorant asks if they can't just use ASCII

              Pfffft! I've been living in a countries whose character sets don't fit into ASCII *my whole life*. Your whole understanding of my motivation is based on a fantasy in your own head. When I'm chatting in a 7-bit medium, I often continue the decades-long convention of '6' being a letter here. Back in the 90s, {, }, and occasionally | were letters too.
              --
              Great minds discuss ideas; average minds discuss events; small minds discuss people; the smallest discuss themselves
    • (Score: 1, Insightful) by Anonymous Coward on Monday February 24 2020, @09:58AM

      by Anonymous Coward on Monday February 24 2020, @09:58AM (#961766)

      Advocating English-centric view in the times of international communication. Amusing and horrible.

  • (Score: 0) by Anonymous Coward on Sunday February 23 2020, @02:20PM

    by Anonymous Coward on Sunday February 23 2020, @02:20PM (#961395)

    It is not as difficult as the article suggests. Everything that needs parsing is already plain ASCII; tags, escaped chars, attribute names, etc. are all ASCII. Only the content can have different encoding and this only matters when you are rendering the text. Assume UTF-8 and render the page. If you hit something that does not look like UTF-8, switch charsets and restart. There is no need to reload the page, only to redraw the content. If done on a per-span or per-block basis, you could easily support multiple charsets on a single page, with a small performance loss for pages that are not using UTF-8. If it is significant, it merely provides them an incentive to do the right thing and switch to UTF-8.

  • (Score: 2) by Mojibake Tengu on Sunday February 23 2020, @08:57PM (1 child)

    by Mojibake Tengu (8598) on Sunday February 23 2020, @08:57PM (#961548) Journal

    <meta name="viewport" content="width=device-width, initial-scale=1">

    I consider forcing scale on user a very bad practice, ask owners of 4K or 8K displays about it. Or, users of Retina devices.

    There's a good reason why I need to have a scale 2 default on all commodity browsers on my BSD desktop.
    Those web authors forcing scale 1 are shooting themselves in the foot, in a long term their webs will be unreadable for everyone.
    It's the same stupidity as seen on legacy GTK and Qt apps.
    At least Qt can workaround that, it's QT_SCALE_FACTOR=8 for me for common apps, on a 4K.

    --
    Respect Authorities. Know your social status. Woke responsibly.
    • (Score: 2) by isj on Sunday February 23 2020, @09:43PM

      by isj (5249) on Sunday February 23 2020, @09:43PM (#961575) Homepage

      The meta viewport is a kludge. It is used as a signal to mobile devices that they don't have to pretend they are a 640x480 pixel device, and instead can show the webpage in a readable manner. Unfortunately without the initial-scale=1 it doesn't work. I grudgingly added it to my website so it looks pretty on mobile devices. I would have preferred a simpler meta tag "yes-you-can-scale-and-reshape-to-fit" but instead we got the viewport tag.

      I don't know anything I can do to make it better for you.

(1)