Stories
Slash Boxes
Comments

SoylentNews is people

SoylentNews is powered by your submissions, so send in your scoop. Only 18 submissions in the queue.
posted by Fnord666 on Sunday March 21 2021, @12:05AM   Printer-friendly
from the who-remembers-mojibake? dept.

Unicode: On Building The One Character Set To Rule Them All:

Most readers will have at least some passing familiarity with the terms 'Unicode' and 'UTF-8', but what is really behind them? At their core they refer to character encoding schemes, also known as character sets. This is a concept which dates back to far beyond the era of electronic computers, to the dawn of the optical telegraph and its predecessors. As far back as the 18th century there was a need to transmit information rapidly across large distances, which was accomplished using so-called telegraph codes. These encoded information using optical, electrical and other means.

During the hundreds of years since the invention of the first telegraph code, there was no real effort to establish international standardization of such encoding schemes, with even the first decades of the era of teleprinters and home computers bringing little change there. Even as EBCDIC (IBM's 8-bit character encoding demonstrated in the punch card above) and finally ASCII made some headway, the need to encode a growing collection of different characters without having to spend ridiculous amounts of storage on this was held back by elegant solutions.

Development of Unicode began during the late 1980s, when the increasing exchange of digital information across the world made the need for a singular encoding system more urgent than before. These days Unicode allows us to not only use a single encoding scheme for everything from basic English text to Traditional Chinese, Vietnamese, and even Mayan, but also small pictographs called 'emoji', from Japanese 'e' (絵) and 'moji' (文字), literally 'picture word'.

[...] The amazing thing is that in only 16-bits, Unicode managed to not only cover all of the Western writing systems, but also many Chinese characters and a variety of specialized symbols, such as those used in mathematics. With 16-bits allowing for 216 = 65,536 code points, the 7,129 characters of Unicode 1.0 fit easily, but by the time Unicode 3.1 rolled around in 2001, Unicode contained no less than 94,140 characters across 41 scripts.

Currently, in version 13, Unicode contains a grand total of 143,859 characters, which does not include control characters. While originally Unicode was envisioned to only encode writing systems which were in current use, by the time Unicode 2.0 was released in 1996, it was realized that this goal would have to be changed, to allow even rare and historic characters to be encoded. In order to accomplish this without necessarily requiring every character to be encoded in 32-bits, Unicode changed to not only encode characters directly, but also using their components, or graphemes.

The concept is somewhat similar to vector drawings, where one doesn't specify every single pixel, but describes instead the elements which make up the drawing. As a result, the Unicode Transformation Format 8 (UTF-8) encoding supports 231 code points, with most characters in the current Unicode character set requiring generally one or two bytes each.

[...] For those of us who enjoyed switching between ISO 8859 encodings in our email clients and web browsers in order to get something approaching the original text representation, consistent Unicode support came as a blessing. I can imagine a similar feeling among those who remember when 7-bit ASCII (or EBCDIC) was all one got, or enjoyed receiving digital documents from a European or US office, only to suffer through character set confusion.

Even if Unicode isn't without its issues, it's hard not to look back and feel that at the very least it's a decent improvement on what came before. Here's to another thirty years of Unicode.

Wikipedia entries for UNICODE and UTF-8.


Original Submission

This discussion has been archived. No new comments can be posted.
Display Options Threshold/Breakthrough Mark All as Read Mark All as Unread
The Fine Print: The following comments are owned by whoever posted them. We are not responsible for them in any way.
(1)
  • (Score: 1) by fustakrakich on Sunday March 21 2021, @12:46AM (7 children)

    by fustakrakich (6150) on Sunday March 21 2021, @12:46AM (#1126906) Journal

    Ban it to hell, and use your old code pages

    --
    La politica e i criminali sono la stessa cosa..
    • (Score: 2) by Runaway1956 on Sunday March 21 2021, @12:49AM (2 children)

      by Runaway1956 (2926) Subscriber Badge on Sunday March 21 2021, @12:49AM (#1126908) Journal

      Unicode is the devil's work!

      Oh, really! So, you're going to take all the credit and all the blame for unicode? Well, if you insist.

      --
      “I have become friends with many school shooters” - Tampon Tim Walz
      • (Score: 0) by Anonymous Coward on Sunday March 21 2021, @07:33AM (1 child)

        by Anonymous Coward on Sunday March 21 2021, @07:33AM (#1127010)

        Αγγλικά χάλια!

        • (Score: 1) by fustakrakich on Sunday March 21 2021, @06:01PM

          by fustakrakich (6150) on Sunday March 21 2021, @06:01PM (#1127182) Journal

          Code page 737 for you, unless you prefer 869

          --
          La politica e i criminali sono la stessa cosa..
    • (Score: 5, Touché) by leon_the_cat on Sunday March 21 2021, @04:10AM

      by leon_the_cat (10052) on Sunday March 21 2021, @04:10AM (#1126962) Journal

      I have no idea what you wrote all i got was question marks

    • (Score: 0) by Anonymous Coward on Sunday March 21 2021, @06:19AM

      by Anonymous Coward on Sunday March 21 2021, @06:19AM (#1126990)

      Codepage 437 forever!

      Codepage 850 is for you Euro-weenies.

    • (Score: 2) by KritonK on Sunday March 21 2021, @09:19AM

      by KritonK (465) on Sunday March 21 2021, @09:19AM (#1127022)

      Code pages!

      You kids had it easy.

      Back in my day we would use punch cards, where the only available letters were capital English. Living in Greece, we had to find creative ways of writing Greek (only capital letters, of course). To start with, some of the Greek characters look the same as English characters (e.g., Α and A), so we'd use those, instead. As for the rest, we'd replace some of the less frequently used special characters, such as [ and ], with Greek. We had fun finding even more creative ways to use those replaced characters when programming in languages that required them.

      Then we started using video terminals, which also had lower case letters. To use Greek (again only capital letters), we'd replace English lower case letters with Greek capital letters. Sheer luxury!

    • (Score: 2) by driverless on Sunday March 21 2021, @09:59AM

      by driverless (4770) on Sunday March 21 2021, @09:59AM (#1127039)

      by the time Unicode 3.1 rolled around in 2001, Unicode contained no less than 94,140 characters across 41 scripts. Currently, in version 13, Unicode contains a grand total of 143,859 characters,

      That's not "143,859 characters", it's 143,859 - 94,140 = 49,719 emojis and the rest are actual characters.

  • (Score: 4, Insightful) by Anonymous Coward on Sunday March 21 2021, @12:50AM (12 children)

    by Anonymous Coward on Sunday March 21 2021, @12:50AM (#1126909)

    So how much of the character space is now taken up by turds, fingers, happy faces of all colours and sizes and such? Is unicode starting to bloat?

    • (Score: 0) by Anonymous Coward on Sunday March 21 2021, @01:02AM (3 children)

      by Anonymous Coward on Sunday March 21 2021, @01:02AM (#1126913)

      The article does end on a good note - another 30 years of Unicode and we can be free of it.

      I’m sticking with ASCII. Fuck Unicode, fuck emojis.

      • (Score: 0) by Anonymous Coward on Sunday March 21 2021, @01:20AM

        by Anonymous Coward on Sunday March 21 2021, @01:20AM (#1126920)

        The article does end on a good note - another 30 years of Unicode and we can be free of it.

        Is that how long you have to live?

      • (Score: 3, Informative) by hendrikboom on Sunday March 21 2021, @01:50AM (1 child)

        by hendrikboom (1125) on Sunday March 21 2021, @01:50AM (#1126939) Homepage Journal

        ASCII, a 7-bit character code extended to eight bits by attaching a zero, is a proper subset of UTF-8.
        It means that a lot of program that use ASCII can be used to process UTF-8 without much trouble.
        You only rarely need to know that a Japanese hiragana is a character instead of a string.

        -- hendrik

        • (Score: 0) by Anonymous Coward on Sunday March 21 2021, @11:52PM

          by Anonymous Coward on Sunday March 21 2021, @11:52PM (#1127277)

          But is a transgender turd emoticon a string, or a sign of a civilization collapsing?

    • (Score: 2, Interesting) by fustakrakich on Sunday March 21 2021, @01:25AM

      by fustakrakich (6150) on Sunday March 21 2021, @01:25AM (#1126922) Journal

      Won't be long before the damn things are animated... then come the decoder rings.. come to think of it, that would be pretty cool, make smart watches look ancient

      --
      La politica e i criminali sono la stessa cosa..
    • (Score: 2) by rigrig on Sunday March 21 2021, @01:26AM (4 children)

      by rigrig (5129) <soylentnews@tubul.net> on Sunday March 21 2021, @01:26AM (#1126925) Homepage

      The point is that users will demand to send turds, fingers, happy faces of all colours and sizes and such. If those aren't added to Unicode, we would regress back from "just" handling UTF-8 strings to the mess where every messenger app and e-mail client uses it's own (slightly different) format.

      --
      No one remembers the singer.
      • (Score: 4, Insightful) by Anonymous Coward on Sunday March 21 2021, @02:38AM (1 child)

        by Anonymous Coward on Sunday March 21 2021, @02:38AM (#1126952)

        ...turds, fingers, happy faces of all colours

        That's the thing! Healthy turds come only in one color, but users will demand red, blue, green, colorless, rainbow, see-through and 5500 other tones and shades, and Unicode will keep adding that frivolous useless kindergarten cartoon shit to Unicode.

        Here is the other thing! If emojis and turds were added to just one place/page/file/allocated bytes in Unicode it could be possible for intelligent users to easily filter that childish crap out, but they are added willy-nilly all over the place and filtering them requires someone to go through all the tens of thousands of pages to find the emoji code so they can be filtered. I'm glad Macs have one emojii font file that can quickly be deleted and your quality of life instantly and measurably increases.

        Unicode 15 years ago seemed like a great wonderful idea, now with all the useless cartoon shit they are adding it's turned into a good idea getting ruined and turned into a joke by woke-diversity-inclusion crap.

        • (Score: 4, Insightful) by aristarchus on Sunday March 21 2021, @07:12AM

          by aristarchus (2645) on Sunday March 21 2021, @07:12AM (#1127006) Journal

          You, my fellow Soylentil, are a complete douche accessible non-literate fungible coprolite. You think adding coding for other than your Imperialist Latin alphabet is a bad idea? We will throw non-English glyphs your way on a regular basis, as your Viking ancestors did years ago. And, we will inflict non-Latin glyphs on your school boys, making them read Greek, as well as Latin, while leaving them in ignorance of the Futhark. And, you will thank us for it, and learn to love the ϕ and the é, not to mention the "Μορόν Λαβἰα" and the ∅, which I have on good authority, means "Americans cannot pronounce this." Perhaps it is so.

      • (Score: 4, Informative) by FatPhil on Sunday March 21 2021, @11:54AM

        by FatPhil (863) <{pc-soylent} {at} {asdf.fi}> on Sunday March 21 2021, @11:54AM (#1127065) Homepage
        > every messenger app and e-mail client uses it's own (slightly different) format.

        Just ask and it shall be granted unto you - Unicode already has that: https://en.wikipedia.org/wiki/Private_Use_Areas
        --
        Great minds discuss ideas; average minds discuss events; small minds discuss people; the smallest discuss themselves
      • (Score: 3, Insightful) by maxwell demon on Sunday March 21 2021, @11:11PM

        by maxwell demon (1608) on Sunday March 21 2021, @11:11PM (#1127261) Journal

        There could have been e.g. HTML tags standardized for those. Not everything needs to be in the character set. After all, there's also no Unicode character for bold (or is there?)

        --
        The Tao of math: The numbers you can count are not the real numbers.
    • (Score: 0) by Anonymous Coward on Sunday March 21 2021, @03:34PM

      by Anonymous Coward on Sunday March 21 2021, @03:34PM (#1127135)

      So how much of the character space is now taken up by turds

      All of it.

    • (Score: 2) by hendrikboom on Sunday March 21 2021, @07:06PM

      by hendrikboom (1125) on Sunday March 21 2021, @07:06PM (#1127202) Homepage Journal

      Those emoji take up a very small fraction of the unicode space -- the full set of CJK characters is *huge*.

  • (Score: 5, Informative) by krishnoid on Sunday March 21 2021, @01:42AM

    by krishnoid (1156) on Sunday March 21 2021, @01:42AM (#1126932)

    This article [joelonsoftware.com] is the best one I've found on understanding everything that goes into Unicode. Contains a historical introduction plus a little background on some non-technical considerations that influenced it.

  • (Score: 2) by RamiK on Sunday March 21 2021, @01:43AM (29 children)

    by RamiK (1813) on Sunday March 21 2021, @01:43AM (#1126934)

    Nothing spells efficiency more than dedicating 32/128 signals to operating the teleprinter's carriage, paper feed.

    --
    compiling...
    • (Score: 2) by RamiK on Sunday March 21 2021, @01:45AM (12 children)

      by RamiK (1813) on Sunday March 21 2021, @01:45AM (#1126935)

      And bell.

      --
      compiling...
      • (Score: 2) by bzipitidoo on Sunday March 21 2021, @02:35AM (8 children)

        by bzipitidoo (4388) on Sunday March 21 2021, @02:35AM (#1126950) Journal

        The control characters should be repurposed! Most of them are useless, and bell was always downright obnoxious. The only control function really used any more is the CR/LF or CR or LF for the end of a line. Tab is a mess.

        Better to use control characters for some markup language capabilities.

        • (Score: 0) by Anonymous Coward on Sunday March 21 2021, @03:25AM

          by Anonymous Coward on Sunday March 21 2021, @03:25AM (#1126957)

          The control characters should be repurposed!

          You misspelled truncated

          Better to use control characters for some markup language capabilities.

          Better to sit in the back of the plane next to Rosie O'Donnell and Tom Arnold

        • (Score: 0) by Anonymous Coward on Sunday March 21 2021, @12:16PM (1 child)

          by Anonymous Coward on Sunday March 21 2021, @12:16PM (#1127069)

          There are quite a lot of communication protocols that still attach special meaning to control codes. They are still used and therefore they are still relevant.

          Ignorance is not a particularly good platform from which to spout nonsense.

          • (Score: 0) by Anonymous Coward on Sunday March 21 2021, @12:21PM

            by Anonymous Coward on Sunday March 21 2021, @12:21PM (#1127070)

            ... and tabs always work for me.

            But if you don't know what you are doing then tabs will probably be "... a mess". Besides, advertising one's experience of messy tabs is simply advertising one's own incompetence.

        • (Score: 0) by Anonymous Coward on Sunday March 21 2021, @03:38PM (4 children)

          by Anonymous Coward on Sunday March 21 2021, @03:38PM (#1127137)

          Tab is a mess.

          Ignorant python user detected. Fire photon torpedoes!

          • (Score: 2) by bzipitidoo on Sunday March 21 2021, @06:34PM (3 children)

            by bzipitidoo (4388) on Sunday March 21 2021, @06:34PM (#1127191) Journal

            Not at all. I love python's use of position to indicate structure. The problem is the means. ASCII is horrible at markup, but it's all we have. Python would be a lot better if leading spaces and tabs were completely unnecessary, and it would be so easy to do. Just need 2 control characters to mean "++indent" and "--indent".

            • (Score: 2) by Rich on Sunday March 21 2021, @11:43PM (2 children)

              by Rich (945) on Sunday March 21 2021, @11:43PM (#1127275) Journal

              Just need 2 control characters to mean "++indent" and "--indent".

              You weren't thinking of $0F and $0E by chance?

              • (Score: 2) by bzipitidoo on Monday March 22 2021, @10:43AM (1 child)

                by bzipitidoo (4388) on Monday March 22 2021, @10:43AM (#1127401) Journal

                Fitting though the names sound, shifting text to the right ("in") and left ("out") wasn't the original meaning of $0E and $0F. No, I am actually thinking $18 and $20 (ctrl-R and ctrl-T) would be the least disruptive choice of control characters. Shift in and out were used to swap in and out different character sets. Another possibility are the separators $1C through $1F. However, for ctrl-T I have in mind a universal close, an ender of any structure, not just an indentation level. T for Terminate Structure. Ctrl-T would be like a </> in HTML, an idea that actually exists in SGML, but was not carried over into HTML, while ctrl-R is closely analogous to <UL>. 3 of the 4 separators may be best employed as <TABLE> <TR> and <TD>, with ctrl-T also used to close a table.

                Mind though, that the only control character that really should remain untouched is LF, with CR a close 2nd. TAB, NULL, and ESC are the next group of control characters I'd leave untouched. Then ctrl-C, to interrupt execution of a program in a terminal, and BS and FF. Terminals also use ctrl-Q and ctrl-S, and ctrl-Z and ctrl-D, but those and ctrl-C could still retain their meanings in terminals while taking on a different meaning within a file. The use of other control characters is practically nonexistent. Anyway, need only a handful, less than a dozen control characters, to add some decent markup capabilities to ASCII of some of the sorts found in lightweight markup languages. The focus should be on structure-- on lists and tables, and not such things as fonts and italic, bold, or color settings. For that latter, we do have some ANSI escape sequences. I should like a text terminal to be capable of rendering it, and without having to perform a lot of calculation or scanning.

                • (Score: 2) by bzipitidoo on Monday March 22 2021, @05:54PM

                  by bzipitidoo (4388) on Monday March 22 2021, @05:54PM (#1127581) Journal

                  Whoops, that should be $12 and $14 for ctrl-R and ctrl-T. 18 and 20 are the decimal values, of course.

      • (Score: 2) by driverless on Sunday March 21 2021, @10:05AM (2 children)

        by driverless (4770) on Sunday March 21 2021, @10:05AM (#1127040)

        And bell.

        Don't forget the shifted version of this, e.g. SO / BEL / SI triggered the gong, if your teleprinter was fitted with one. Can't remember which shift sequence was used for the whistle.

        • (Score: 2) by RamiK on Sunday March 21 2021, @01:20PM

          by RamiK (1813) on Sunday March 21 2021, @01:20PM (#1127094)

          You jest but at least one teleprinter at a printing press was modified with a pneumatic whistle instead of a bell to be used in a noisy environment.

          --
          compiling...
        • (Score: 2) by Muad'Dave on Monday March 22 2021, @01:16PM

          by Muad'Dave (1413) on Monday March 22 2021, @01:16PM (#1127424)

          ... and Ctrl-D (EOT) that turned it off. We used to message ctrl-D to our gaming enemies to hobble them.

    • (Score: 0) by Anonymous Coward on Sunday March 21 2021, @04:03AM (15 children)

      by Anonymous Coward on Sunday March 21 2021, @04:03AM (#1126961)

      I remember ASCII art.

      Is there Unicode art?

      • (Score: 2) by kazzie on Sunday March 21 2021, @05:50AM (12 children)

        by kazzie (5309) Subscriber Badge on Sunday March 21 2021, @05:50AM (#1126981)

        I think it's emojiis. *shudder*

        • (Score: 1) by Anti-aristarchus on Sunday March 21 2021, @06:08AM (11 children)

          by Anti-aristarchus (14390) on Sunday March 21 2021, @06:08AM (#1126985) Journal

          The people who "shudder" at emojis, tend to do the same at Kanji, or Hebrew, or Devanagari. We need to have a coding that covers all human languages, no matter how much of a minority they are, because you never know when the the glyph is the right glyph. No one has seen the first contract movie? Oh, it is Arrival [imdb.com], only five years old. Linguists, you needs them.

          • (Score: 0) by Anonymous Coward on Sunday March 21 2021, @09:04AM (1 child)

            by Anonymous Coward on Sunday March 21 2021, @09:04AM (#1127021)

            The movie featuring aliens who refuse to speak math and respond to fibonacci sequences because that would make things too easy.

            • (Score: 2) by Immerman on Sunday March 21 2021, @02:08PM

              by Immerman (3985) on Sunday March 21 2021, @02:08PM (#1127109)

              As I recall the "magic key" that finally enabled communication was realizing that the aliens didn't experience time linearly, which would mean that any sort of sequential representation of anything might be unintelligible to them.

              Of course such sequences could be represented all at once as a static image, and it seems likely they would have at least the concept of a sequence. But perhaps they were just here to study the local animals and see if any had become sapient yet. I mean sure, some of the primates had clearly started making tools, but can an animal truly be called sapient when they're still limited to sequential thought?

          • (Score: 3, Insightful) by Dr Spin on Sunday March 21 2021, @09:44AM (1 child)

            by Dr Spin (5239) on Sunday March 21 2021, @09:44AM (#1127031)

            The people who "shudder" at emojis, tend to do the same at Kanji

            I am not sure that is true. Lets get this strait - its not about Ludditeism - Unicode has its merits:

            For years I have argued that kanji should replace those appalling icons
            that are changed as soon as you have learned the, Kanji has not changed much in 4,000 years.
            Google rarely keep an icon for 3 months. I buy loads of stuff from Ebay, and the sellers often write
            their names with Kanji. I rarely write my name in Futhork, but prefer my Greek place names
            to be rendered correctly when driving in Greece. I am sure my Jewish neighbours would find it useful to
            be able to send Hebrew over their cellphones if only the Sanhedrin allowed them to use technology
            even if banned by the Amish.

            I am perfectly happy to use ;-) but I no reason at all why it should be subject to auto-replacement by
            an evil robot who lives in the cellphone. Those emojis certainly make me shudder, and I rarely
            know what they represent anyway.

            (Actually, while I was writing this rant, someone close to me pointed out that Emojis are the only
            way stark illiterates can send text messages - and they probably account for 25% of cellphone users).

            Should I change my name to The UniToad?

            --
            Warning: Opening your mouth may invalidate your brain!
            • (Score: 2) by Dr Spin on Sunday March 21 2021, @09:46AM

              by Dr Spin (5239) on Sunday March 21 2021, @09:46AM (#1127035)

              I should point out that I grew up punching 5-hole paper tape with Baudot code using a stylus to punch the holes one at a time.

              (EDSAC 1) - its not a good way to communicate.

              --
              Warning: Opening your mouth may invalidate your brain!
          • (Score: 2) by kazzie on Sunday March 21 2021, @10:30AM (5 children)

            by kazzie (5309) Subscriber Badge on Sunday March 21 2021, @10:30AM (#1127042)

            I have no qualms about any of the scripts you mentioned, and am specifically glad that Unicode means that there is now widespread support for two circumflex accents in my native language (ŵ, ŷ) that the old Code Pages ignored altogether.

            I don't tend to use many pictograms myself (beyond emoticons), but I really couldn't care less about arguments about "how come there's an emojii for aubergine/eggplants but not for parsnips?!

            • (Score: 0, Disagree) by Anonymous Coward on Sunday March 21 2021, @12:34PM (4 children)

              by Anonymous Coward on Sunday March 21 2021, @12:34PM (#1127073)
              Languages hangs. The use of accented characters is neither needed nor anything more than an archaism.
              • (Score: 2) by Pino P on Sunday March 21 2021, @01:08PM (3 children)

                by Pino P (4721) on Sunday March 21 2021, @01:08PM (#1127086) Journal

                The use of accented characters is neither needed nor anything more than an archaism.

                What should be done instead to represent more than five vowel sounds or more than one tone?

                • (Score: 0) by Anonymous Coward on Sunday March 21 2021, @03:55PM (2 children)

                  by Anonymous Coward on Sunday March 21 2021, @03:55PM (#1127148)

                  Nothing. Let the reader figure it out. Most people have dropped the é from “resumé” and the reader gets the meaning from context, such as “drop off your resume.”or the writer uses the quicker “drop off your cv”.

                  It’s not like people don’t easily deal with spoken ambiguous words, such as there/they’re/their, which all sound the same, from context. Drop the accents, the cedillas, the umlauts, schwas, etc. They were used to indicate how the written word is pronounced, but as the there/they’re/their example shows, you can either use different spellings, or as the resume/cv example show, same spelling, meaning from context. Unionized can either be something that is not in an ionized state, or a bargaining collective - same spelling, different pronunciation based solely on context. Same with Polish - either a nationality or the act of making something shiny.

                  Accents are not needed. Even the French are starting to think about getting rid of them.

                  • (Score: 2) by aristarchus on Sunday March 21 2021, @09:36PM (1 child)

                    by aristarchus (2645) on Sunday March 21 2021, @09:36PM (#1127231) Journal

                    . Most people have dropped the é from “resumé”

                    Well, let's start again, then.

                    • (Score: 0) by Anti-aristarchus on Thursday March 25 2021, @12:34AM

                      by Anti-aristarchus (14390) on Thursday March 25 2021, @12:34AM (#1128596) Journal

                      Most people have dropped the é from “resumé”

                      Well, let's start again, then.

                      Are you saying we should resume the discussion about dropping the acute accent from “resumé”?

          • (Score: 2) by maxwell demon on Sunday March 21 2021, @11:04PM

            by maxwell demon (1608) on Sunday March 21 2021, @11:04PM (#1127259) Journal

            For me, the main problem with emojis is not that they contain pictograms; already the IBM PC character set contained such characters. My main problem with emojis is that they go beyond what a font is meant to be, by specifying colours. There are only two “colours” that have any place in a font: Foreground and background. After all, there's a good reason why we don't have an Unicode character for “DEEP SKY BLUE LETTER L”.

            There is no such problem with Kanji, Hebrew, Devanagary, or any of the other scripts. Or any of the “classic” pictograms.

            So to summarize:

            • This is absolutely OK: ॐ
            • And so is this: ᚹ
            • Also this is OK: ♥
            • But this is not: 💙
            --
            The Tao of math: The numbers you can count are not the real numbers.
      • (Score: 3, Funny) by FatPhil on Sunday March 21 2021, @11:58AM (1 child)

        by FatPhil (863) <{pc-soylent} {at} {asdf.fi}> on Sunday March 21 2021, @11:58AM (#1127066) Homepage
        > Is there Unicode art?

        ¯\_(ツ)_/¯
        --
        Great minds discuss ideas; average minds discuss events; small minds discuss people; the smallest discuss themselves
  • (Score: 0) by Anonymous Coward on Sunday March 21 2021, @02:30PM

    by Anonymous Coward on Sunday March 21 2021, @02:30PM (#1127112)

    I'll send 100 bitcoin to anyone who can put chinese and japanese variants of the same characters into a plain text document.

    Can't be done because Unicode is shit.

  • (Score: 2) by DannyB on Monday March 22 2021, @03:18PM (2 children)

    by DannyB (5839) Subscriber Badge on Monday March 22 2021, @03:18PM (#1127500) Journal

    To handle all of the possible emojis let's get UTF-640.

    640 bits per character. That ought to be enough for anybody!

    I propose that out of that immense space of characters, we carve out a "tiny" 2^64 bit space reserved for a group of characters that are an 8 by 8 square grid. (Call these squares pixels if you will, but they would be actual squares in an OpenType font that has these character glyphs.) There would be an 8x8 grid of squares with one glyph in the font for every possible combination of squares dark or light. (yes, font file sizes may be gigabytes, but hey, computers will be more powerful by the time UTF-640 is widely adopted.)

    Among all the hieroglyphs (emojis) will be every combination of 8x8 pixels, including what we once recognized as dot-matrix text, such as on a green screen CRT or dot matrix printer. Thus our hieroglyphs will still enable us to communicate meaningfully when you just can't find the right emoji or hieroglyph.

    Once we could communicate in 7-bit ASCII. But UTF-640 will be the way of the next generation.

    We could also carve out a space in UTF-640 to fit all the EBCDIC characters in their correct ordering so that only an upper 632-bit prefix is necessary on every EBCDIC byte!

    --
    The server will be down for replacement of vacuum tubes, belts, worn parts and lubrication of gears and bearings.
    • (Score: 0) by Anonymous Coward on Monday March 22 2021, @03:31PM (1 child)

      by Anonymous Coward on Monday March 22 2021, @03:31PM (#1127509)

      There are many alien languages in the universe.

      • (Score: 2) by DannyB on Monday March 22 2021, @04:19PM

        by DannyB (5839) Subscriber Badge on Monday March 22 2021, @04:19PM (#1127545) Journal

        UTF-640 should be enough for Sci Fi alien languages.

        I thought 640 bits per character ought to be enough for anybody?

        Are you suggesting it isn't? You may have just identified a problem.

        We need to begin development on an infinitely expandable variable length character set that can be mandated for universal use.

        --
        The server will be down for replacement of vacuum tubes, belts, worn parts and lubrication of gears and bearings.
(1)