Unicode: on Building the One Character Set to Rule Them All

posted by Fnord666 on Sunday March 21 2021, @12:05AM

from the who-remembers-mojibake? dept.

upstart writes in with an IRC submission:

Unicode: On Building The One Character Set To Rule Them All:

Most readers will have at least some passing familiarity with the terms 'Unicode' and 'UTF-8', but what is really behind them? At their core they refer to character encoding schemes, also known as character sets. This is a concept which dates back to far beyond the era of electronic computers, to the dawn of the optical telegraph and its predecessors. As far back as the 18th century there was a need to transmit information rapidly across large distances, which was accomplished using so-called telegraph codes. These encoded information using optical, electrical and other means.
During the hundreds of years since the invention of the first telegraph code, there was no real effort to establish international standardization of such encoding schemes, with even the first decades of the era of teleprinters and home computers bringing little change there. Even as EBCDIC (IBM's 8-bit character encoding demonstrated in the punch card above) and finally ASCII made some headway, the need to encode a growing collection of different characters without having to spend ridiculous amounts of storage on this was held back by elegant solutions.
Development of Unicode began during the late 1980s, when the increasing exchange of digital information across the world made the need for a singular encoding system more urgent than before. These days Unicode allows us to not only use a single encoding scheme for everything from basic English text to Traditional Chinese, Vietnamese, and even Mayan, but also small pictographs called 'emoji', from Japanese 'e' (絵) and 'moji' (文字), literally 'picture word'.

[...] The amazing thing is that in only 16-bits, Unicode managed to not only cover all of the Western writing systems, but also many Chinese characters and a variety of specialized symbols, such as those used in mathematics. With 16-bits allowing for 2¹⁶ = 65,536 code points, the 7,129 characters of Unicode 1.0 fit easily, but by the time Unicode 3.1 rolled around in 2001, Unicode contained no less than 94,140 characters across 41 scripts.
Currently, in version 13, Unicode contains a grand total of 143,859 characters, which does not include control characters. While originally Unicode was envisioned to only encode writing systems which were in current use, by the time Unicode 2.0 was released in 1996, it was realized that this goal would have to be changed, to allow even rare and historic characters to be encoded. In order to accomplish this without necessarily requiring every character to be encoded in 32-bits, Unicode changed to not only encode characters directly, but also using their components, or graphemes.
The concept is somewhat similar to vector drawings, where one doesn't specify every single pixel, but describes instead the elements which make up the drawing. As a result, the Unicode Transformation Format 8 (UTF-8) encoding supports 2³¹ code points, with most characters in the current Unicode character set requiring generally one or two bytes each.
[...] For those of us who enjoyed switching between ISO 8859 encodings in our email clients and web browsers in order to get something approaching the original text representation, consistent Unicode support came as a blessing. I can imagine a similar feeling among those who remember when 7-bit ASCII (or EBCDIC) was all one got, or enjoyed receiving digital documents from a European or US office, only to suffer through character set confusion.
Even if Unicode isn't without its issues, it's hard not to look back and feel that at the very least it's a decent improvement on what came before. Here's to another thirty years of Unicode.

Wikipedia entries for UNICODE and UTF-8.

Original Submission

This discussion has been archived. No new comments can be posted.

Unicode: on Building the One Character Set to Rule Them All | Log In/Create an Account | Top | 56 comments | Search Discussion

The Fine Print: The following comments are owned by whoever posted them. We are not responsible for them in any way.

SoylentNews

SoylentNews is people

Navigation

Sections

SoylentNews

Log In

Related Links

Unicode: on Building the One Character Set to Rule Them All

Unicode is the devil's work!Unicode is the devil's work! (Score: 1) by fustakrakich on Sunday March 21 2021, @12:46AM (7 children)

Re:Unicode is the devil's work!Re:Unicode is the devil's work! (Score: 2) by Runaway1956 on Sunday March 21 2021, @12:49AM (2 children)

Re:Unicode is the devil's work!Re:Unicode is the devil's work! (Score: 0) by Anonymous Coward on Sunday March 21 2021, @07:33AM (1 child)

Re:Unicode is the devil's work!(Score: 1) by fustakrakich on Sunday March 21 2021, @06:01PM

Re:Unicode is the devil's work!(Score: 5, Touché) by leon_the_cat on Sunday March 21 2021, @04:10AM

Re:Unicode is the devil's work!(Score: 0) by Anonymous Coward on Sunday March 21 2021, @06:19AM

Re:Unicode is the devil's work!(Score: 2) by KritonK on Sunday March 21 2021, @09:19AM

Re:Unicode is the devil's work!(Score: 2) by driverless on Sunday March 21 2021, @09:59AM

Bloatcode-8Bloatcode-8 (Score: 4, Insightful) by Anonymous Coward on Sunday March 21 2021, @12:50AM (12 children)

Re:Bloatcode-8Re:Bloatcode-8 (Score: 0) by Anonymous Coward on Sunday March 21 2021, @01:02AM (3 children)

Re:Bloatcode-8(Score: 0) by Anonymous Coward on Sunday March 21 2021, @01:20AM

Re:Bloatcode-8Re:Bloatcode-8 (Score: 3, Informative) by hendrikboom on Sunday March 21 2021, @01:50AM (1 child)

Re:Bloatcode-8(Score: 0) by Anonymous Coward on Sunday March 21 2021, @11:52PM

Re:Bloatcode-8(Score: 2, Interesting) by fustakrakich on Sunday March 21 2021, @01:25AM

Re:Bloatcode-8Re:Bloatcode-8 (Score: 2) by rigrig on Sunday March 21 2021, @01:26AM (4 children)

Re:Bloatcode-8Re:Bloatcode-8 (Score: 4, Insightful) by Anonymous Coward on Sunday March 21 2021, @02:38AM (1 child)

Re:Bloatcode-8(Score: 4, Insightful) by aristarchus on Sunday March 21 2021, @07:12AM

Re:Bloatcode-8(Score: 4, Informative) by FatPhil on Sunday March 21 2021, @11:54AM

Re:Bloatcode-8(Score: 3, Insightful) by maxwell demon on Sunday March 21 2021, @11:11PM

Re:Bloatcode-8(Score: 0) by Anonymous Coward on Sunday March 21 2021, @03:34PM

Re:Bloatcode-8(Score: 2) by hendrikboom on Sunday March 21 2021, @07:06PM

The best explanation(Score: 5, Informative) by krishnoid on Sunday March 21 2021, @01:42AM

Aha yes 7-bit ASCII...Aha yes 7-bit ASCII... (Score: 2) by RamiK on Sunday March 21 2021, @01:43AM (29 children)

Re:Aha yes 7-bit ASCII...Re:Aha yes 7-bit ASCII... (Score: 2) by RamiK on Sunday March 21 2021, @01:45AM (12 children)

Re:Aha yes 7-bit ASCII...Re:Aha yes 7-bit ASCII... (Score: 2) by bzipitidoo on Sunday March 21 2021, @02:35AM (8 children)

Re:Aha yes 7-bit ASCII...(Score: 0) by Anonymous Coward on Sunday March 21 2021, @03:25AM

Re:Aha yes 7-bit ASCII...Re:Aha yes 7-bit ASCII... (Score: 0) by Anonymous Coward on Sunday March 21 2021, @12:16PM (1 child)

Re:Aha yes 7-bit ASCII...(Score: 0) by Anonymous Coward on Sunday March 21 2021, @12:21PM

Re:Aha yes 7-bit ASCII...Re:Aha yes 7-bit ASCII... (Score: 0) by Anonymous Coward on Sunday March 21 2021, @03:38PM (4 children)

Re:Aha yes 7-bit ASCII...Re:Aha yes 7-bit ASCII... (Score: 2) by bzipitidoo on Sunday March 21 2021, @06:34PM (3 children)

Re:Aha yes 7-bit ASCII...Re:Aha yes 7-bit ASCII... (Score: 2) by Rich on Sunday March 21 2021, @11:43PM (2 children)

Re:Aha yes 7-bit ASCII...Re:Aha yes 7-bit ASCII... (Score: 2) by bzipitidoo on Monday March 22 2021, @10:43AM (1 child)

Re:Aha yes 7-bit ASCII...(Score: 2) by bzipitidoo on Monday March 22 2021, @05:54PM

Re:Aha yes 7-bit ASCII...Re:Aha yes 7-bit ASCII... (Score: 2) by driverless on Sunday March 21 2021, @10:05AM (2 children)

Re:Aha yes 7-bit ASCII...(Score: 2) by RamiK on Sunday March 21 2021, @01:20PM

Re:Aha yes 7-bit ASCII...(Score: 2) by Muad'Dave on Monday March 22 2021, @01:16PM

Re:Aha yes 7-bit ASCII...Re:Aha yes 7-bit ASCII... (Score: 0) by Anonymous Coward on Sunday March 21 2021, @04:03AM (15 children)

Re:Aha yes 7-bit ASCII...Re:Aha yes 7-bit ASCII... (Score: 2) by kazzie on Sunday March 21 2021, @05:50AM (12 children)

Re:Aha yes 7-bit ASCII...Re:Aha yes 7-bit ASCII... (Score: 1) by Anti-aristarchus on Sunday March 21 2021, @06:08AM (11 children)

Re:Aha yes 7-bit ASCII...Re:Aha yes 7-bit ASCII... (Score: 0) by Anonymous Coward on Sunday March 21 2021, @09:04AM (1 child)

Re:Aha yes 7-bit ASCII...(Score: 2) by Immerman on Sunday March 21 2021, @02:08PM

Re:Aha yes 7-bit ASCII...Re:Aha yes 7-bit ASCII... (Score: 3, Insightful) by Dr Spin on Sunday March 21 2021, @09:44AM (1 child)

Re:Aha yes 7-bit ASCII...(Score: 2) by Dr Spin on Sunday March 21 2021, @09:46AM

Re:Aha yes 7-bit ASCII...Re:Aha yes 7-bit ASCII... (Score: 2) by kazzie on Sunday March 21 2021, @10:30AM (5 children)

Re:Aha yes 7-bit ASCII...Re:Aha yes 7-bit ASCII... (Score: 0, Disagree) by Anonymous Coward on Sunday March 21 2021, @12:34PM (4 children)

Languages with more vowel sounds than LatinLanguages with more vowel sounds than Latin (Score: 2) by Pino P on Sunday March 21 2021, @01:08PM (3 children)

Re:Languages with more vowel sounds than LatinRe:Languages with more vowel sounds than Latin (Score: 0) by Anonymous Coward on Sunday March 21 2021, @03:55PM (2 children)

Re:Languages with more vowel sounds than LatinRe:Languages with more vowel sounds than Latin (Score: 2) by aristarchus on Sunday March 21 2021, @09:36PM (1 child)

Re:Languages with more vowel sounds than Latin(Score: 0) by Anti-aristarchus on Thursday March 25 2021, @12:34AM

Re:Aha yes 7-bit ASCII...(Score: 2) by maxwell demon on Sunday March 21 2021, @11:04PM

Re:Aha yes 7-bit ASCII...Re:Aha yes 7-bit ASCII... (Score: 3, Funny) by FatPhil on Sunday March 21 2021, @11:58AM (1 child)

Re:Aha yes 7-bit ASCII...(Score: 1, Informative) by Anonymous Coward on Sunday March 21 2021, @06:46PM

Han unification was a huge mistake(Score: 0) by Anonymous Coward on Sunday March 21 2021, @02:30PM

I'm waiting for UTF-640I'm waiting for UTF-640 (Score: 2) by DannyB on Monday March 22 2021, @03:18PM (2 children)

Re:I'm waiting for UTF-640Re:I'm waiting for UTF-640 (Score: 0) by Anonymous Coward on Monday March 22 2021, @03:31PM (1 child)

Re:I'm waiting for UTF-640(Score: 2) by DannyB on Monday March 22 2021, @04:19PM