Stories
Slash Boxes
Comments

SoylentNews is people

Log In

Log In

Create Account  |  Retrieve Password


Python

Posted by turgid on Saturday June 20 2015, @09:10AM (#1293)
6 Comments
Code

The time has come to learn some Python. I have a rough idea what it is having read about it in the past and probably spent about half an hour playing with it many years ago.

I'm so busy these days (working long hours, family life) I find it hard to keep up with all the developments so I'd like to ask a couple of questions, since I believe the Python language changes significantly between each major release.

At my current place of work, we have development systems running ancient versions of Red Hat with Python 2.6.x. At home I have Slackware which has Python 2.7.5 by default. There are much newer versions of Python out in the wild these days, and I'm not scared to compile from source.

So, which version of Python should I start with? In a nutshell, what are the main differences? Which parts of the language are backwards-compatible?

testing some utf-8 encoded unicode chars after rehash loaded

Posted by martyb on Wednesday June 03 2015, @02:22AM (#1268)
2 Comments
Code

This is a test story which contains a variety of 1-, 2-, and 3-octet UTF-8 chars. The purpose is to see how well the e-mailing of stories handles these characters. These chars were entered directly (actually, cut-and-paste) as opposed to being entered as decimal/hex/named character entities.

The following is taken from: "3. UTF-8 definition" in: https://tools.ietf.org/html/rfc3629 [ietf.org]

    Char. number range  |        UTF-8 octet sequence
       (hexadecimal)    |              (binary)
    --------------------+---------------------------------------------
    0000 0000-0000 007F | 0xxxxxxx
    0000 0080-0000 07FF | 110xxxxx 10xxxxxx
    0000 0800-0000 FFFF | 1110xxxx 10xxxxxx 10xxxxxx
    0001 0000-0010 FFFF | 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx

peugen 0x40 0x7f 0x0140 0x017f 0x0700 0x073f 0x0800 0x083f | peu2utf8 > bleh.txt
cat bleh.txt

@ABCDEFGHIJKLMNO
PQRSTUVWXYZ[\]^_
`abcdefghijklmno
pqrstuvwxyz{|}~�

ŀŁłŃńŅņŇňʼnŊŋŌōŎŏ
ŐőŒœŔŕŖŗŘřŚśŜŝŞş
ŠšŢţŤťŦŧŨũŪūŬŭŮů
ŰűŲųŴŵŶŷŸŹźŻżŽžſ

܀܁܂܃܄܅܆܇܈܉܊܋܌܍܎܏
ܐܑܒܓܔܕܖܗܘܙܚܛܜܝܞܟ
ܠܡܢܣܤܥܦܧܨܩܪܫܬܭܮܯ

ࠀࠁࠂࠃࠄࠅࠆࠇࠈࠉࠊࠋࠌࠍࠎࠏ
ࠐࠑࠒࠓࠔࠕࠚ
ࠤࠥࠦࠧࠨ࠮࠯
࠰࠱࠲࠳࠴࠵࠶࠷࠸࠹࠺࠻࠼࠽࠾࠿

---
That was one block of 1-octet UTF-8 chars; two blocks of 2-octet UTF-8 chars, and one block of 3-octet chars, submitted as 'plain old text'

Rehash loaded!

Posted by martyb on Monday June 01 2015, @03:32AM (#1261)
0 Comments
Code

Here it is the first of June in 2015, and our dev team has been working long and hard to get the foundation code to this site upgraded to handle newer versions of perl and apache. I lent a hand with QA duties and can attest that this was no small feat. Many *many* thanks to NCommander and Paulej72!

And, this acts as a test that the journal code is still working. Please let me know if you cannot see it! ;)

Yooman Rights

Posted by turgid on Monday May 11 2015, @08:23PM (#1213)
0 Comments
Topics

Yooman rights? Yooman rights! I don't need no yooman rights! I ain't foreign and I ain't done nuffink wrong.

Michael "Teachers are the Enemies of Promise" Gove is going to give us a nice British Bill of Rights instead. They did promise to stop their supporters voting for Nigel and the bigots. Nigel didn't resign after all.

And Gove is going to be working with Theresa May, who will be pushing through the Snoopers Charter.

And the kickings are about to begin.

General Election's a Comin'

Posted by turgid on Tuesday May 05 2015, @07:58PM (#1198)
2 Comments
/dev/random

Here in Blighty, we're having a General Election on Thursday 7th May.

This time around, the Official Monster Raving Loony Party has conceded that it will probably lose votes to UKIP.

Oh dear.

SoylentNews, Unicode, UTF-8, and HTML

Posted by martyb on Friday April 24 2015, @12:08AM (#1176)
0 Comments
Code

NOTE: This is a work-in-progress; read at your own risk/confusion. It is an attempt to gather together bookmarks, tabs, and information pertaining to Unicode, UTF-8, HTML, and 'characters'.

It would seem to be a simple enough question to answer, but things are not always as they seem:

What characters should SoylentNews support?

Motivation: as many of you are aware, one of the early improvements that SoylentNews made to its base source code was to support Unicode characters. (Thanks to the heroic efforts of The Mighty Buzzard.) The underlying code only supported ASCII (American Standard Code for Information Interchange) characters. Which was just fine for as far as it went. It just didn't go far enough for us...

I took on the task of testing our implementation of UTF-8 support. Little did I know what I was getting into! It has been a fascinating journey, indeed!

What is Unicode?

This is taken from What is Unicode?:

Fundamentally, computers just deal with numbers. They store letters and other characters by assigning a number for each one. Before Unicode was invented, there were hundreds of different encoding systems for assigning these numbers. No single encoding could contain enough characters: for example, the European Union alone requires several different encodings to cover all its languages. Even for a single language like English no single encoding was adequate for all the letters, punctuation, and technical symbols in common use.

These encoding systems also conflict with one another. That is, two encodings can use the same number for two different characters, or use different numbers for the same character. Any given computer (especially servers) needs to support many different encodings; yet whenever data is passed between different encodings or platforms, that data always runs the risk of corruption.

Unicode is changing all that!

Unicode provides a unique number for every character, no matter what the platform, no matter what the program, no matter what the language. The Unicode Standard has been adopted by such industry leaders as Apple, HP, IBM, JustSystems, Microsoft, Oracle, SAP, Sun, Sybase, Unisys and many others. Unicode is required by modern standards such as XML, Java, ECMAScript (JavaScript), LDAP, CORBA 3.0, WML, etc., and is the official way to implement ISO/IEC 10646. It is supported in many operating systems, all modern browsers, and many other products. The emergence of the Unicode Standard, and the availability of tools supporting it, are among the most significant recent global software technology trends.

Here is an excerpt from Wikipedia's entry for Unicode:

Unicode has the explicit aim of transcending the limitations of traditional character encodings, such as those defined by the ISO 8859 standard, which find wide usage in various countries of the world but remain largely incompatible with each other. Many traditional character encodings share a common problem in that they allow bilingual computer processing (usually using Latin characters and the local script), but not multilingual computer processing (computer processing of arbitrary scripts mixed with each other).

Unicode, in intent, encodes the underlying characters—graphemes and grapheme-like units—rather than the variant glyphs (renderings) for such characters. ...

In text processing, Unicode takes the role of providing a unique code point—a number, not a glyph—for each character. In other words, Unicode represents a character in an abstract way and leaves the visual rendering (size, shape, font, or style) to other software, such as a web browser or word processor.

A little more background: There are certain code points in Unicode that are of questionable value in the context of a web page; further, there are code points which are defined to be invalid! And then, just to make things even more interesting, I found a list of invalid characters in an HTML document:

Illegal characters

HTML forbids[6] the use of the characters with Universal Character Set/Unicode code points (in decimal form, preceded by x in hexadecimal form)

  • 0 to 31, except 9, 10, and 13 (C0 control characters)
  • 127 (DEL character)
  • 128 to 159 (x80 – x9F, C1 control characters)
  • 55296 to 57343 (xD800 – xDFFF, the UTF-16 surrogate halves)

The Unicode standard also forbids:

  • 65534 and 65535 (xFFFE – xFFFF), non-characters, related to xFEFF, the byte order mark.

UTF-8; Unicode Transfer Format - 8-bit

Though there are several means by which Unicode characters can be transmitted between contexts, one of the most popular is UTF-8, which is what was chosen for use in SoylentNews.

SoylentNews:

What you see from our site mostly comes via a browser (though we also support Gopher and NNTP; you can have stories e-mailed to you; and we also have an RSS/Atom feeds... wow!)

Our site currently formats web pages as HTML 4.01; here's a representative DOCTYPE:

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN"
            "http://www.w3.org/TR/html4/strict.dtd">

At some point in the future we may want to directly support HTML5; ideally nothing should preclude or complicate that effort.

See also:

Obviously, we need not support the invalid code points. (Enumerate them here).

Unicode and UTF-8

So Unicode is a collection of mappings of code-points (numbers) to 'characters'; UTF-8 is a Unicode Transformation Format, 8-bit, used to transmit/encode Unicode code points.

To be continued...

Irritable Duncan Syndrome

Posted by turgid on Monday March 30 2015, @09:16PM (#1120)
2 Comments
News

Irritable Duncan "Trust-me-I-know-what-I'm doing" Syndrome reckons that, when he and the rest of the Conservative Party are re-elected in this May's General Election, he'll make £12 billion (US$17.8 billion) of welfare cuts but he won't tell us before the election what these cuts will be, Allegedly, it's "Not relevant."

There aren't that many poor, sick, disabled and needy left un-kicked, but it's highly amusing that thousands of people in one of the world's most highly-developed countries are having to resort to food banks.

Goodness only knows how much worse it will get if the loony right UKIP get some seats. Anyone but an imbecile can see that they'd vote with the Conservatives on many issues or even form a coalition.

So hurry up and vote Tory to keep the hopeless, sneering socialists down.

God save the Queen etc.

UTF-8 Regression Testing

Posted by martyb on Sunday March 29 2015, @05:37PM (#1115)
6 Comments
Code

This is just a place to hang some UTF-8 character regression tests.

Spam Moderation Label

Posted by bryan on Sunday January 04 2015, @09:35AM (#931)
0 Comments
Slash

There's been some interesting discussions on adding more labels to the moderation system. Although opinions on “Disagree” and “Factually Incorrect” may still be varied, nearly everyone supported the addition of a “Spam” label.

As such, I've implemented a "Spam" moderation label on Pipedot. We'll see about the others as more people weigh in.

utf-8 regression testing

Posted by martyb on Sunday August 10 2014, @02:49AM (#567)
7 Comments
Slash

cf: http://dev.soylentnews.org/comments.pl?sid=1115&cid=27307

See: http://www.w3.org/2004/04/uri-rel-test.html

All of the following were entered using <a href="...">>...</a>

Test 101: http://www.w%33.org/

Should be: http://www.w3.org/

Test 111: http://r%c3%a4ksm%c3%b6rg%c3%a5s.josefsson.org/

Should be: http://räksmörgås.josefsson.org/

Test 112: http://%e7%b4%8d%e8%b1%86.w3.mag.keio.ac.jp/

Should be: http://�豆.w3.mag.keio.ac.jp/

Test 121: http://www.%e3%81%bb%e3%82%93%e3%81%a8%e3%81%86%e3%81%ab%e3%81%aa%e3%81%8c%e3%81%84%e3%82%8f%e3%81%91%e3%81%ae%e3%82%8f%e3%81%8b%e3%82%89%e3%81%aa%e3%81%84%e3%81%a9%e3%82%81%e3%81%84%e3%82%93%e3%82%81%e3%81%84%e3%81%ae%e3%82%89%e3%81%b9%e3%82%8b%e3%81%be%e3%81%a0%e3%81%aa%e3%81%8c%e3%81%8f%e3%81%97%e3%81%aa%e3%81%84%e3%81%a8%e3%81%9f%e3%82%8a%e3%81%aa%e3%81%84.w3.mag.keio.ac.jp/

Should be: http://www.�ん�����������ら�����ん���ら�る����������り��.w3.mag.keio.ac.jp/

Test 122: http://%e3%81%bb%e3%82%93%e3%81%a8%e3%81%86%e3%81%ab%e3%81%aa%e3%81%8c%e3%81%84%e3%82%8f%e3%81%91%e3%81%ae%e3%82%8f%e3%81%8b%e3%82%89%e3%81%aa%e3%81%84%e3%81%a9%e3%82%81%e3%81%84%e3%82%93%e3%82%81%e3%81%84%e3%81%ae%e3%82%89%e3%81%b9%e3%82%8b%e3%81%be%e3%81%a0%e3%81%aa%e3%81%8c%e3%81%8f%e3%81%97%e3%81%aa%e3%81%84%e3%81%a8%e3%81%9f%e3%82%8a%e3%81%aa%e3%81%84.%e3%81%bb%e3%82%93%e3%81%a8%e3%81%86%e3%81%ab%e3%81%aa%e3%81%8c%e3%81%84%e3%82%8f%e3%81%91%e3%81%ae%e3%82%8f%e3%81%8b%e3%82%89%e3%81%aa%e3%81%84%e3%81%a9%e3%82%81%e3%81%84%e3%82%93%e3%82%81%e3%81%84%e3%81%ae%e3%82%89%e3%81%b9%e3%82%8b%e3%81%be%e3%81%a0%e3%81%aa%e3%81%8c%e3%81%8f%e3%81%97%e3%81%aa%e3%81%84%e3%81%a8%e3%81%9f%e3%82%8a%e3%81%aa%e3%81%84.%e3%81%bb%e3%82%93%e3%81%a8%e3%81%86%e3%81%ab%e3%81%aa%e3%81%8c%e3%81%84%e3%82%8f%e3%81%91%e3%81%ae%e3%82%8f%e3%81%8b%e3%82%89%e3%81%aa%e3%81%84%e3%81%a9%e3%82%81%e3%81%84%e3%82%93%e3%82%81%e3%81%84%e3%81%ae%e3%82%89%e3%81%b9%e3%82%8b%e3%81%be%e3%81%a0%e3%81%aa%e3%81%8c%e3%81%8f%e3%81%97%e3%81%aa%e3%81%84%e3%81%a8%e3%81%9f%e3%82%8a%e3%81%aa%e3%81%84.w3.mag.keio.ac.jp/

Should be: http://�ん�����������ら�����ん���ら�る����������り��.�ん�����������ら�����ん���ら�る����������り��.�ん�����������ら�����ん���ら�る����������り��.w3.mag.keio.ac.jp/

Lameness filter encountered.
Your comment violated the "postercomment" compression filter. Try less whitespace and/or less repetition.

Lorem ipsum dolor sit amet, consectetur adipiscing elit. Morbi porta tempus nunc, vel gravida eros. Fusce ac sapien sed elit adipiscing pharetra at vel neque. Cras consequat a nisi vitae interdum. Nulla pulvinar, nisi a varius venenatis, lorem mauris posuere nulla, sit amet venenatis enim mauris quis tellus. Fusce nec ullamcorper lorem. Proin vulputate leo sapien, sollicitudin tincidunt urna eleifend vel. Etiam eleifend nulla id leo egestas interdum. Sed dignissim mauris eget tincidunt fermentum. Sed nec felis et nisl ullamcorper gravida varius in augue. Morbi ac erat quis dolor ultricies pulvinar. Vivamus sagittis viverra leo et sollicitudin. Maecenas at vulputate tortor. Donec ipsum erat, bibendum vel viverra eu, ornare vel sem.