Unicode: On Building The One Character Set To Rule Them All:
Most readers will have at least some passing familiarity with the terms 'Unicode' and 'UTF-8', but what is really behind them? At their core they refer to character encoding schemes, also known as character sets. This is a concept which dates back to far beyond the era of electronic computers, to the dawn of the optical telegraph and its predecessors. As far back as the 18th century there was a need to transmit information rapidly across large distances, which was accomplished using so-called telegraph codes. These encoded information using optical, electrical and other means.
During the hundreds of years since the invention of the first telegraph code, there was no real effort to establish international standardization of such encoding schemes, with even the first decades of the era of teleprinters and home computers bringing little change there. Even as EBCDIC (IBM's 8-bit character encoding demonstrated in the punch card above) and finally ASCII made some headway, the need to encode a growing collection of different characters without having to spend ridiculous amounts of storage on this was held back by elegant solutions.
Development of Unicode began during the late 1980s, when the increasing exchange of digital information across the world made the need for a singular encoding system more urgent than before. These days Unicode allows us to not only use a single encoding scheme for everything from basic English text to Traditional Chinese, Vietnamese, and even Mayan, but also small pictographs called 'emoji', from Japanese 'e' (絵) and 'moji' (文字), literally 'picture word'.
[...] The amazing thing is that in only 16-bits, Unicode managed to not only cover all of the Western writing systems, but also many Chinese characters and a variety of specialized symbols, such as those used in mathematics. With 16-bits allowing for 216 = 65,536 code points, the 7,129 characters of Unicode 1.0 fit easily, but by the time Unicode 3.1 rolled around in 2001, Unicode contained no less than 94,140 characters across 41 scripts.
Currently, in version 13, Unicode contains a grand total of 143,859 characters, which does not include control characters. While originally Unicode was envisioned to only encode writing systems which were in current use, by the time Unicode 2.0 was released in 1996, it was realized that this goal would have to be changed, to allow even rare and historic characters to be encoded. In order to accomplish this without necessarily requiring every character to be encoded in 32-bits, Unicode changed to not only encode characters directly, but also using their components, or graphemes.
The concept is somewhat similar to vector drawings, where one doesn't specify every single pixel, but describes instead the elements which make up the drawing. As a result, the Unicode Transformation Format 8 (UTF-8) encoding supports 231 code points, with most characters in the current Unicode character set requiring generally one or two bytes each.
[...] For those of us who enjoyed switching between ISO 8859 encodings in our email clients and web browsers in order to get something approaching the original text representation, consistent Unicode support came as a blessing. I can imagine a similar feeling among those who remember when 7-bit ASCII (or EBCDIC) was all one got, or enjoyed receiving digital documents from a European or US office, only to suffer through character set confusion.
Even if Unicode isn't without its issues, it's hard not to look back and feel that at the very least it's a decent improvement on what came before. Here's to another thirty years of Unicode.
Wikipedia entries for UNICODE and UTF-8.
(Score: 1) by fustakrakich on Sunday March 21 2021, @12:46AM (7 children)
Ban it to hell, and use your old code pages
La politica e i criminali sono la stessa cosa..
(Score: 2) by Runaway1956 on Sunday March 21 2021, @12:49AM (2 children)
Oh, really! So, you're going to take all the credit and all the blame for unicode? Well, if you insist.
“I have become friends with many school shooters” - Tampon Tim Walz
(Score: 0) by Anonymous Coward on Sunday March 21 2021, @07:33AM (1 child)
Αγγλικά χάλια!
(Score: 1) by fustakrakich on Sunday March 21 2021, @06:01PM
Code page 737 for you, unless you prefer 869
La politica e i criminali sono la stessa cosa..
(Score: 5, Touché) by leon_the_cat on Sunday March 21 2021, @04:10AM
I have no idea what you wrote all i got was question marks
(Score: 0) by Anonymous Coward on Sunday March 21 2021, @06:19AM
Codepage 437 forever!
Codepage 850 is for you Euro-weenies.
(Score: 2) by KritonK on Sunday March 21 2021, @09:19AM
Code pages!
You kids had it easy.
Back in my day we would use punch cards, where the only available letters were capital English. Living in Greece, we had to find creative ways of writing Greek (only capital letters, of course). To start with, some of the Greek characters look the same as English characters (e.g., Α and A), so we'd use those, instead. As for the rest, we'd replace some of the less frequently used special characters, such as [ and ], with Greek. We had fun finding even more creative ways to use those replaced characters when programming in languages that required them.
Then we started using video terminals, which also had lower case letters. To use Greek (again only capital letters), we'd replace English lower case letters with Greek capital letters. Sheer luxury!
(Score: 2) by driverless on Sunday March 21 2021, @09:59AM
That's not "143,859 characters", it's 143,859 - 94,140 = 49,719 emojis and the rest are actual characters.
(Score: 4, Insightful) by Anonymous Coward on Sunday March 21 2021, @12:50AM (12 children)
So how much of the character space is now taken up by turds, fingers, happy faces of all colours and sizes and such? Is unicode starting to bloat?
(Score: 0) by Anonymous Coward on Sunday March 21 2021, @01:02AM (3 children)
The article does end on a good note - another 30 years of Unicode and we can be free of it.
I’m sticking with ASCII. Fuck Unicode, fuck emojis.
(Score: 0) by Anonymous Coward on Sunday March 21 2021, @01:20AM
Is that how long you have to live?
(Score: 3, Informative) by hendrikboom on Sunday March 21 2021, @01:50AM (1 child)
ASCII, a 7-bit character code extended to eight bits by attaching a zero, is a proper subset of UTF-8.
It means that a lot of program that use ASCII can be used to process UTF-8 without much trouble.
You only rarely need to know that a Japanese hiragana is a character instead of a string.
-- hendrik
(Score: 0) by Anonymous Coward on Sunday March 21 2021, @11:52PM
But is a transgender turd emoticon a string, or a sign of a civilization collapsing?
(Score: 2, Interesting) by fustakrakich on Sunday March 21 2021, @01:25AM
Won't be long before the damn things are animated... then come the decoder rings.. come to think of it, that would be pretty cool, make smart watches look ancient
La politica e i criminali sono la stessa cosa..
(Score: 2) by rigrig on Sunday March 21 2021, @01:26AM (4 children)
The point is that users will demand to send turds, fingers, happy faces of all colours and sizes and such. If those aren't added to Unicode, we would regress back from "just" handling UTF-8 strings to the mess where every messenger app and e-mail client uses it's own (slightly different) format.
No one remembers the singer.
(Score: 4, Insightful) by Anonymous Coward on Sunday March 21 2021, @02:38AM (1 child)
...turds, fingers, happy faces of all colours
That's the thing! Healthy turds come only in one color, but users will demand red, blue, green, colorless, rainbow, see-through and 5500 other tones and shades, and Unicode will keep adding that frivolous useless kindergarten cartoon shit to Unicode.
Here is the other thing! If emojis and turds were added to just one place/page/file/allocated bytes in Unicode it could be possible for intelligent users to easily filter that childish crap out, but they are added willy-nilly all over the place and filtering them requires someone to go through all the tens of thousands of pages to find the emoji code so they can be filtered. I'm glad Macs have one emojii font file that can quickly be deleted and your quality of life instantly and measurably increases.
Unicode 15 years ago seemed like a great wonderful idea, now with all the useless cartoon shit they are adding it's turned into a good idea getting ruined and turned into a joke by woke-diversity-inclusion crap.
(Score: 4, Insightful) by aristarchus on Sunday March 21 2021, @07:12AM
You, my fellow Soylentil, are a complete douche accessible non-literate fungible coprolite. You think adding coding for other than your Imperialist Latin alphabet is a bad idea? We will throw non-English glyphs your way on a regular basis, as your Viking ancestors did years ago. And, we will inflict non-Latin glyphs on your school boys, making them read Greek, as well as Latin, while leaving them in ignorance of the Futhark. And, you will thank us for it, and learn to love the ϕ and the é, not to mention the "Μορόν Λαβἰα" and the ∅, which I have on good authority, means "Americans cannot pronounce this." Perhaps it is so.
(Score: 4, Informative) by FatPhil on Sunday March 21 2021, @11:54AM
Just ask and it shall be granted unto you - Unicode already has that: https://en.wikipedia.org/wiki/Private_Use_Areas
Great minds discuss ideas; average minds discuss events; small minds discuss people; the smallest discuss themselves
(Score: 3, Insightful) by maxwell demon on Sunday March 21 2021, @11:11PM
There could have been e.g. HTML tags standardized for those. Not everything needs to be in the character set. After all, there's also no Unicode character for bold (or is there?)
The Tao of math: The numbers you can count are not the real numbers.
(Score: 0) by Anonymous Coward on Sunday March 21 2021, @03:34PM
All of it.
(Score: 2) by hendrikboom on Sunday March 21 2021, @07:06PM
Those emoji take up a very small fraction of the unicode space -- the full set of CJK characters is *huge*.
(Score: 5, Informative) by krishnoid on Sunday March 21 2021, @01:42AM
This article [joelonsoftware.com] is the best one I've found on understanding everything that goes into Unicode. Contains a historical introduction plus a little background on some non-technical considerations that influenced it.
(Score: 2) by RamiK on Sunday March 21 2021, @01:43AM (29 children)
Nothing spells efficiency more than dedicating 32/128 signals to operating the teleprinter's carriage, paper feed.
compiling...
(Score: 2) by RamiK on Sunday March 21 2021, @01:45AM (12 children)
And bell.
compiling...
(Score: 2) by bzipitidoo on Sunday March 21 2021, @02:35AM (8 children)
The control characters should be repurposed! Most of them are useless, and bell was always downright obnoxious. The only control function really used any more is the CR/LF or CR or LF for the end of a line. Tab is a mess.
Better to use control characters for some markup language capabilities.
(Score: 0) by Anonymous Coward on Sunday March 21 2021, @03:25AM
You misspelled truncated
Better to sit in the back of the plane next to Rosie O'Donnell and Tom Arnold
(Score: 0) by Anonymous Coward on Sunday March 21 2021, @12:16PM (1 child)
There are quite a lot of communication protocols that still attach special meaning to control codes. They are still used and therefore they are still relevant.
Ignorance is not a particularly good platform from which to spout nonsense.
(Score: 0) by Anonymous Coward on Sunday March 21 2021, @12:21PM
... and tabs always work for me.
But if you don't know what you are doing then tabs will probably be "... a mess". Besides, advertising one's experience of messy tabs is simply advertising one's own incompetence.
(Score: 0) by Anonymous Coward on Sunday March 21 2021, @03:38PM (4 children)
Ignorant python user detected. Fire photon torpedoes!
(Score: 2) by bzipitidoo on Sunday March 21 2021, @06:34PM (3 children)
Not at all. I love python's use of position to indicate structure. The problem is the means. ASCII is horrible at markup, but it's all we have. Python would be a lot better if leading spaces and tabs were completely unnecessary, and it would be so easy to do. Just need 2 control characters to mean "++indent" and "--indent".
(Score: 2) by Rich on Sunday March 21 2021, @11:43PM (2 children)
You weren't thinking of $0F and $0E by chance?
(Score: 2) by bzipitidoo on Monday March 22 2021, @10:43AM (1 child)
Fitting though the names sound, shifting text to the right ("in") and left ("out") wasn't the original meaning of $0E and $0F. No, I am actually thinking $18 and $20 (ctrl-R and ctrl-T) would be the least disruptive choice of control characters. Shift in and out were used to swap in and out different character sets. Another possibility are the separators $1C through $1F. However, for ctrl-T I have in mind a universal close, an ender of any structure, not just an indentation level. T for Terminate Structure. Ctrl-T would be like a </> in HTML, an idea that actually exists in SGML, but was not carried over into HTML, while ctrl-R is closely analogous to <UL>. 3 of the 4 separators may be best employed as <TABLE> <TR> and <TD>, with ctrl-T also used to close a table.
Mind though, that the only control character that really should remain untouched is LF, with CR a close 2nd. TAB, NULL, and ESC are the next group of control characters I'd leave untouched. Then ctrl-C, to interrupt execution of a program in a terminal, and BS and FF. Terminals also use ctrl-Q and ctrl-S, and ctrl-Z and ctrl-D, but those and ctrl-C could still retain their meanings in terminals while taking on a different meaning within a file. The use of other control characters is practically nonexistent. Anyway, need only a handful, less than a dozen control characters, to add some decent markup capabilities to ASCII of some of the sorts found in lightweight markup languages. The focus should be on structure-- on lists and tables, and not such things as fonts and italic, bold, or color settings. For that latter, we do have some ANSI escape sequences. I should like a text terminal to be capable of rendering it, and without having to perform a lot of calculation or scanning.
(Score: 2) by bzipitidoo on Monday March 22 2021, @05:54PM
Whoops, that should be $12 and $14 for ctrl-R and ctrl-T. 18 and 20 are the decimal values, of course.
(Score: 2) by driverless on Sunday March 21 2021, @10:05AM (2 children)
Don't forget the shifted version of this, e.g. SO / BEL / SI triggered the gong, if your teleprinter was fitted with one. Can't remember which shift sequence was used for the whistle.
(Score: 2) by RamiK on Sunday March 21 2021, @01:20PM
You jest but at least one teleprinter at a printing press was modified with a pneumatic whistle instead of a bell to be used in a noisy environment.
compiling...
(Score: 2) by Muad'Dave on Monday March 22 2021, @01:16PM
... and Ctrl-D (EOT) that turned it off. We used to message ctrl-D to our gaming enemies to hobble them.
(Score: 0) by Anonymous Coward on Sunday March 21 2021, @04:03AM (15 children)
I remember ASCII art.
Is there Unicode art?
(Score: 2) by kazzie on Sunday March 21 2021, @05:50AM (12 children)
I think it's emojiis. *shudder*
(Score: 1) by Anti-aristarchus on Sunday March 21 2021, @06:08AM (11 children)
The people who "shudder" at emojis, tend to do the same at Kanji, or Hebrew, or Devanagari. We need to have a coding that covers all human languages, no matter how much of a minority they are, because you never know when the the glyph is the right glyph. No one has seen the first contract movie? Oh, it is Arrival [imdb.com], only five years old. Linguists, you needs them.
(Score: 0) by Anonymous Coward on Sunday March 21 2021, @09:04AM (1 child)
The movie featuring aliens who refuse to speak math and respond to fibonacci sequences because that would make things too easy.
(Score: 2) by Immerman on Sunday March 21 2021, @02:08PM
As I recall the "magic key" that finally enabled communication was realizing that the aliens didn't experience time linearly, which would mean that any sort of sequential representation of anything might be unintelligible to them.
Of course such sequences could be represented all at once as a static image, and it seems likely they would have at least the concept of a sequence. But perhaps they were just here to study the local animals and see if any had become sapient yet. I mean sure, some of the primates had clearly started making tools, but can an animal truly be called sapient when they're still limited to sequential thought?
(Score: 3, Insightful) by Dr Spin on Sunday March 21 2021, @09:44AM (1 child)
The people who "shudder" at emojis, tend to do the same at Kanji
I am not sure that is true. Lets get this strait - its not about Ludditeism - Unicode has its merits:
For years I have argued that kanji should replace those appalling icons
that are changed as soon as you have learned the, Kanji has not changed much in 4,000 years.
Google rarely keep an icon for 3 months. I buy loads of stuff from Ebay, and the sellers often write
their names with Kanji. I rarely write my name in Futhork, but prefer my Greek place names
to be rendered correctly when driving in Greece. I am sure my Jewish neighbours would find it useful to
be able to send Hebrew over their cellphones if only the Sanhedrin allowed them to use technology
even if banned by the Amish.
I am perfectly happy to use ;-) but I no reason at all why it should be subject to auto-replacement by
an evil robot who lives in the cellphone. Those emojis certainly make me shudder, and I rarely
know what they represent anyway.
(Actually, while I was writing this rant, someone close to me pointed out that Emojis are the only
way stark illiterates can send text messages - and they probably account for 25% of cellphone users).
Should I change my name to The UniToad?
Warning: Opening your mouth may invalidate your brain!
(Score: 2) by Dr Spin on Sunday March 21 2021, @09:46AM
I should point out that I grew up punching 5-hole paper tape with Baudot code using a stylus to punch the holes one at a time.
(EDSAC 1) - its not a good way to communicate.
Warning: Opening your mouth may invalidate your brain!
(Score: 2) by kazzie on Sunday March 21 2021, @10:30AM (5 children)
I have no qualms about any of the scripts you mentioned, and am specifically glad that Unicode means that there is now widespread support for two circumflex accents in my native language (ŵ, ŷ) that the old Code Pages ignored altogether.
I don't tend to use many pictograms myself (beyond emoticons), but I really couldn't care less about arguments about "how come there's an emojii for aubergine/eggplants but not for parsnips?!
(Score: 0, Disagree) by Anonymous Coward on Sunday March 21 2021, @12:34PM (4 children)
(Score: 2) by Pino P on Sunday March 21 2021, @01:08PM (3 children)
What should be done instead to represent more than five vowel sounds or more than one tone?
(Score: 0) by Anonymous Coward on Sunday March 21 2021, @03:55PM (2 children)
Nothing. Let the reader figure it out. Most people have dropped the é from “resumé” and the reader gets the meaning from context, such as “drop off your resume.”or the writer uses the quicker “drop off your cv”.
It’s not like people don’t easily deal with spoken ambiguous words, such as there/they’re/their, which all sound the same, from context. Drop the accents, the cedillas, the umlauts, schwas, etc. They were used to indicate how the written word is pronounced, but as the there/they’re/their example shows, you can either use different spellings, or as the resume/cv example show, same spelling, meaning from context. Unionized can either be something that is not in an ionized state, or a bargaining collective - same spelling, different pronunciation based solely on context. Same with Polish - either a nationality or the act of making something shiny.
Accents are not needed. Even the French are starting to think about getting rid of them.
(Score: 2) by aristarchus on Sunday March 21 2021, @09:36PM (1 child)
Well, let's start again, then.
(Score: 0) by Anti-aristarchus on Thursday March 25 2021, @12:34AM
Are you saying we should resume the discussion about dropping the acute accent from “resumé”?
(Score: 2) by maxwell demon on Sunday March 21 2021, @11:04PM
For me, the main problem with emojis is not that they contain pictograms; already the IBM PC character set contained such characters. My main problem with emojis is that they go beyond what a font is meant to be, by specifying colours. There are only two “colours” that have any place in a font: Foreground and background. After all, there's a good reason why we don't have an Unicode character for “DEEP SKY BLUE LETTER L”.
There is no such problem with Kanji, Hebrew, Devanagary, or any of the other scripts. Or any of the “classic” pictograms.
So to summarize:
The Tao of math: The numbers you can count are not the real numbers.
(Score: 3, Funny) by FatPhil on Sunday March 21 2021, @11:58AM (1 child)
¯\_(ツ)_/¯
Great minds discuss ideas; average minds discuss events; small minds discuss people; the smallest discuss themselves
(Score: 1, Informative) by Anonymous Coward on Sunday March 21 2021, @06:46PM
Ask and it shall be given http://xahlee.info/comp/unicode_ascii_art.html [xahlee.info]
(Score: 0) by Anonymous Coward on Sunday March 21 2021, @02:30PM
I'll send 100 bitcoin to anyone who can put chinese and japanese variants of the same characters into a plain text document.
Can't be done because Unicode is shit.
(Score: 2) by DannyB on Monday March 22 2021, @03:18PM (2 children)
To handle all of the possible emojis let's get UTF-640.
640 bits per character. That ought to be enough for anybody!
I propose that out of that immense space of characters, we carve out a "tiny" 2^64 bit space reserved for a group of characters that are an 8 by 8 square grid. (Call these squares pixels if you will, but they would be actual squares in an OpenType font that has these character glyphs.) There would be an 8x8 grid of squares with one glyph in the font for every possible combination of squares dark or light. (yes, font file sizes may be gigabytes, but hey, computers will be more powerful by the time UTF-640 is widely adopted.)
Among all the hieroglyphs (emojis) will be every combination of 8x8 pixels, including what we once recognized as dot-matrix text, such as on a green screen CRT or dot matrix printer. Thus our hieroglyphs will still enable us to communicate meaningfully when you just can't find the right emoji or hieroglyph.
Once we could communicate in 7-bit ASCII. But UTF-640 will be the way of the next generation.
We could also carve out a space in UTF-640 to fit all the EBCDIC characters in their correct ordering so that only an upper 632-bit prefix is necessary on every EBCDIC byte!
The server will be down for replacement of vacuum tubes, belts, worn parts and lubrication of gears and bearings.
(Score: 0) by Anonymous Coward on Monday March 22 2021, @03:31PM (1 child)
There are many alien languages in the universe.
(Score: 2) by DannyB on Monday March 22 2021, @04:19PM
UTF-640 should be enough for Sci Fi alien languages.
I thought 640 bits per character ought to be enough for anybody?
Are you suggesting it isn't? You may have just identified a problem.
We need to begin development on an infinitely expandable variable length character set that can be mandated for universal use.
The server will be down for replacement of vacuum tubes, belts, worn parts and lubrication of gears and bearings.