Stories
Slash Boxes
Comments

SoylentNews is people

Log In

Log In

Create Account  |  Retrieve Password


SoylentNews, Unicode, UTF-8, and HTML

Posted by martyb on Friday April 24 2015, @12:08AM (#1176)
0 Comments
Code

NOTE: This is a work-in-progress; read at your own risk/confusion. It is an attempt to gather together bookmarks, tabs, and information pertaining to Unicode, UTF-8, HTML, and 'characters'.

It would seem to be a simple enough question to answer, but things are not always as they seem:

What characters should SoylentNews support?

Motivation: as many of you are aware, one of the early improvements that SoylentNews made to its base source code was to support Unicode characters. (Thanks to the heroic efforts of The Mighty Buzzard.) The underlying code only supported ASCII (American Standard Code for Information Interchange) characters. Which was just fine for as far as it went. It just didn't go far enough for us...

I took on the task of testing our implementation of UTF-8 support. Little did I know what I was getting into! It has been a fascinating journey, indeed!

What is Unicode?

This is taken from What is Unicode?:

Fundamentally, computers just deal with numbers. They store letters and other characters by assigning a number for each one. Before Unicode was invented, there were hundreds of different encoding systems for assigning these numbers. No single encoding could contain enough characters: for example, the European Union alone requires several different encodings to cover all its languages. Even for a single language like English no single encoding was adequate for all the letters, punctuation, and technical symbols in common use.

These encoding systems also conflict with one another. That is, two encodings can use the same number for two different characters, or use different numbers for the same character. Any given computer (especially servers) needs to support many different encodings; yet whenever data is passed between different encodings or platforms, that data always runs the risk of corruption.

Unicode is changing all that!

Unicode provides a unique number for every character, no matter what the platform, no matter what the program, no matter what the language. The Unicode Standard has been adopted by such industry leaders as Apple, HP, IBM, JustSystems, Microsoft, Oracle, SAP, Sun, Sybase, Unisys and many others. Unicode is required by modern standards such as XML, Java, ECMAScript (JavaScript), LDAP, CORBA 3.0, WML, etc., and is the official way to implement ISO/IEC 10646. It is supported in many operating systems, all modern browsers, and many other products. The emergence of the Unicode Standard, and the availability of tools supporting it, are among the most significant recent global software technology trends.

Here is an excerpt from Wikipedia's entry for Unicode:

Unicode has the explicit aim of transcending the limitations of traditional character encodings, such as those defined by the ISO 8859 standard, which find wide usage in various countries of the world but remain largely incompatible with each other. Many traditional character encodings share a common problem in that they allow bilingual computer processing (usually using Latin characters and the local script), but not multilingual computer processing (computer processing of arbitrary scripts mixed with each other).

Unicode, in intent, encodes the underlying characters—graphemes and grapheme-like units—rather than the variant glyphs (renderings) for such characters. ...

In text processing, Unicode takes the role of providing a unique code point—a number, not a glyph—for each character. In other words, Unicode represents a character in an abstract way and leaves the visual rendering (size, shape, font, or style) to other software, such as a web browser or word processor.

A little more background: There are certain code points in Unicode that are of questionable value in the context of a web page; further, there are code points which are defined to be invalid! And then, just to make things even more interesting, I found a list of invalid characters in an HTML document:

Illegal characters

HTML forbids[6] the use of the characters with Universal Character Set/Unicode code points (in decimal form, preceded by x in hexadecimal form)

  • 0 to 31, except 9, 10, and 13 (C0 control characters)
  • 127 (DEL character)
  • 128 to 159 (x80 – x9F, C1 control characters)
  • 55296 to 57343 (xD800 – xDFFF, the UTF-16 surrogate halves)

The Unicode standard also forbids:

  • 65534 and 65535 (xFFFE – xFFFF), non-characters, related to xFEFF, the byte order mark.

UTF-8; Unicode Transfer Format - 8-bit

Though there are several means by which Unicode characters can be transmitted between contexts, one of the most popular is UTF-8, which is what was chosen for use in SoylentNews.

SoylentNews:

What you see from our site mostly comes via a browser (though we also support Gopher and NNTP; you can have stories e-mailed to you; and we also have an RSS/Atom feeds... wow!)

Our site currently formats web pages as HTML 4.01; here's a representative DOCTYPE:

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN"
            "http://www.w3.org/TR/html4/strict.dtd">

At some point in the future we may want to directly support HTML5; ideally nothing should preclude or complicate that effort.

See also:

Obviously, we need not support the invalid code points. (Enumerate them here).

Unicode and UTF-8

So Unicode is a collection of mappings of code-points (numbers) to 'characters'; UTF-8 is a Unicode Transformation Format, 8-bit, used to transmit/encode Unicode code points.

To be continued...

Product Review: Seagate Personal Cloud

Posted by mcgrew on Wednesday April 15 2015, @05:57PM (#1160)
4 Comments
Hardware

Around the first of the year all three working computers were just about stuffed full, so I thought of sticking a spare drive in the Linux box, when the Linux box died from a hardware problem. It's too old to spend time and money on, so its drive is going in the XP box (which is, of course, not on the network; except sneakernet). I decided to break down and buy an external hard drive. I found what I was looking for in the "Seagate Personal Cloud". And here I thought the definition of "the cloud" was someone else's server!

I ordered it the beginning of January, not noticing that it was a preorder; it wasn't released until late March. I got it right before April.

I was annoyed with its lack of documentation -- it had a tiny pamphlet full of pictures and icons and very few words. Whoever put that pamphlet together must beleive the old adage "a picture is worth a thousand words". Tell me, if a picture is worth a thousand words, convey that thought in pictures. I don't think it can be done.

I did find a good manual on the internet. For what I wanted, I really didn't need a manual, but since I'm a nerd I wanted to understand everything about the thing. Before looking for a manual I plugged it all up, and Windows 7 had no problem connecting with it. It takes a few minutes to boot; it isn't really simply a drive, it must have an operating system and network software, because it looks to the W7 notebook to be another file server. Its only connections are a jack for the power cord and a network jack.

The model I got has three terrabytes. I moved all the data from the two working computers (using a thumb drive to move data from XP) and the "cloud" was still empty. Streaming audio and video from it is flawless; I'm completely satisfied with it, it's a fine piece of hardware.

However, it WON'T do what is advertised to do, which is to be able to get to your data from anywhere. In order to do that, Seagate has a "software as a service" thing where you can connect to a computer from anywhere, but only the computer and its internal drives, NOT the "personal cloud". And they want ten bucks a month for it.

I downloaded the Android app, and I could see and copy files that were on my notebook to my phone, but I couldn't play music stored there on it. I uninstalled the crap. "Software as a service" is IMO evil in the first place, but to carge a monthly fee to use a piece of crap software like this is an insult. Barnum must have been right.

If you're just looking for an external hard drive, like I was, it's a good solution. If you want what they're advertising, you ain't gettin' it. The Seagate Personal Cloud's name is a lie, as is its advertising.

Irritable Duncan Syndrome

Posted by turgid on Monday March 30 2015, @09:16PM (#1120)
2 Comments
News

Irritable Duncan "Trust-me-I-know-what-I'm doing" Syndrome reckons that, when he and the rest of the Conservative Party are re-elected in this May's General Election, he'll make £12 billion (US$17.8 billion) of welfare cuts but he won't tell us before the election what these cuts will be, Allegedly, it's "Not relevant."

There aren't that many poor, sick, disabled and needy left un-kicked, but it's highly amusing that thousands of people in one of the world's most highly-developed countries are having to resort to food banks.

Goodness only knows how much worse it will get if the loony right UKIP get some seats. Anyone but an imbecile can see that they'd vote with the Conservatives on many issues or even form a coalition.

So hurry up and vote Tory to keep the hopeless, sneering socialists down.

God save the Queen etc.

UTF-8 Regression Testing

Posted by martyb on Sunday March 29 2015, @05:37PM (#1115)
6 Comments
Code

This is just a place to hang some UTF-8 character regression tests.

We've been spelling it wrong for over a quarter century

Posted by mcgrew on Monday March 23 2015, @11:52AM (#1099)
23 Comments
/dev/random

I'm surprised that this hasn't been addressed by the academic communities. Someone with a degree in English or linguistics or something like that should have though of this decades ago.

This word (actually more than one word) has various spellings, and I've probably used all of them at one time or another. The word is email, or eMail, or e-mail, or some other variation. They're all wrong.

It's a contraction of "electronic mail" and as such should be spelled e'mail. The same with e'books and other e'words.

So why hasn't someone with a PhD in English pointed this out to me? I have no formal collegiate training in this field. It's a mystery to me.

Are printed books' days numbered?

Posted by mcgrew on Friday March 20 2015, @09:53PM (#1097)
6 Comments
Hardware

In his 1951 short story The Fun They Had, Isaac Asimov has a boy who finds something really weird in the attic -- a printed book. In this future, all reading was done on screens.

When e'books* like the Nook and Kindle came out, there were always women sitting outside the building on break on a nice spring day reading their Nooks and Kindles. It looked like the future to me, Asimov's story come true. I prefer printed books, but thought that it was because I'm old, and was thirty before I read anything but TV and movie credits on a screen.

And then I started writing books. My youngest daughter Patty is going to school at Cincinnati University (as a proud dad I have to add that she's Phi Beta Kappa and working full time! I'm not just proud, I'm in awe of her) and when she came home on break and I handed her a hardbound copy of Nobots she said "My dad wrote a book! And it's a REAL book!"

So somehow, even young people like Patty value printed books over e'books.

My audience is mostly nerds, since few non-nerds know of me or my writing, so I figured that the free e'book would far surpass sales of the printed books. Instead, few people are downloading the e'books. More download the PDFs, and more people buy the printed books than PDFs and ebooks combined.

Most people just read the HTML online, maybe that's a testament to my m4d sk1llz at HTML (yeah, right).

Five years ago I was convinced ink was on the way out, but there's a book that was printed long before the first computer was turned on that says "the news of my death has been greatly exaggerated".

* I'll write a short story about the weird spelling shortly.

The old tightwad

Posted by Runaway1956 on Friday February 27 2015, @03:12AM (#1044)
0 Comments
/dev/random

I don't spend much money, and I seldom give any to online people. But - yeah, I'm aware that Soylent is in need of money. Then I saw it - an UNOBTAINIUM KEY CHAIN!! I'll be the first kid on my block to acquire unobtainium! I'll save my pennies and nickles, and discretely order a few more of these over the next months - and I can then build my UNOBTAINIUM BOMB!

Ooooh, I haven't been this excited since I ordered that little battery powered submarine when I was six or seven years old!

Triplanetary

Posted by mcgrew on Friday February 20 2015, @12:03AM (#1027)
2 Comments
News

I've uploaded a new book to mcgrewbooks.com. Edgar E. Smith was a well known science fiction writer known as "the father of space opera", and Doctor Smith was a food engineer in his other life. The novel I've uploaded is Triplanetary, first published in serial form in Amazing Stories in 1934.

Some of the dialogue is a bit juvenile, but it would make a great movie.

An Accidental Book

Posted by mcgrew on Monday February 16 2015, @06:47PM (#1019)
2 Comments
/dev/random

I've read books accidentally, meaning to read a single chapter and winding up reading it in one setting, but I've never started writing one accidentally.

Until now.

Tired of editing Random Scribblings and Voyage to Earth and Other Stories (Formerly titled "Mars Bars"), I thought I'd look for another science fiction novel in the public domain a little less ancient than The Time Machine to add to my web site.

I didn't find one, so decided to just make a book of public domain short stories by the 20th century greats. I found a LOT, and started assembling a book. Somehow, I wound up adding commentary and thought "Hey! New book!"

Then I discovered that one of the short stories wasn't so short -- in fact, it was a full blown novel. So for the last several days I've been formatting it to put on my web site. E.E. "Doc" Smith's Triplanetary will be posted in a few days.

I'll let you know when it's there. I guess I'm working on three books again. The collection I'm working on is tentatively titled "Yesterday's Tomorrow".

Is Microsoft Sirius?

Posted by mcgrew on Wednesday February 11 2015, @01:50PM (#1006)
1 Comment
News

I had to laugh when I ran across this article.

"Cortana's UI now expresses 18 different emotions. Siri remains detached and aloof."

Yes, Microsoft is apparently the Sirius Cybernetics Corporation with its " Genuine People Personalities". So when are they going to make that "Marvin" interface?