Stories
Slash Boxes
Comments

SoylentNews is people

posted by n1 on Tuesday September 23 2014, @03:32AM   Printer-friendly
from the blame-game dept.

Poor encoding by Microsoft blamed for problems in a UK initiative to improve data transparency.

When you export from popular spreadsheet applications you don't get control over encoding and it usually chooses a bad one," she said. "It usually won't be UTF-8. It will usually be something like Windows 1252."

Windows 1252 was an old, proprietary Microsoft encoding. The result, said Tennison, was the data contained characters incomprehensible to other people and programs. Their systems - unless they were using Microsoft Excel on a Microsoft Windows computer - interpreted the incomprehensible characters as "garbage".

"It can cause problems matching stuff up," she said. "If you have the name correct in some data and not in other data then you can't match those two names together. And therefore you can't put the data together accurately."

Does anyone have any interesting character encoding stories?

This discussion has been archived. No new comments can be posted.
Display Options Threshold/Breakthrough Mark All as Read Mark All as Unread
The Fine Print: The following comments are owned by whoever posted them. We are not responsible for them in any way.
  • (Score: 4, Insightful) by EvilJim on Tuesday September 23 2014, @03:37AM

    by EvilJim (2501) on Tuesday September 23 2014, @03:37AM (#97021) Journal

    but I use Openoffice to repair corrupt MS Office documents, MS Office just fails to open the document at all whereas OpenOffice just does it's best guess and usually works just fine. I cant stand the crap pile that is MS Office, we're running 2007 which still has bugs reported from 2003 and have never been fixed. Why are we paying volume license fees for them to just give up and bring out a new version that we have to purchase again? if I could sue I would.

    • (Score: 4, Insightful) by Grishnakh on Tuesday September 23 2014, @05:00AM

      by Grishnakh (2831) on Tuesday September 23 2014, @05:00AM (#97032)

      You can sue, but you'll lose. When you purchased that MS crapware, you agreed to a EULA which holds Microsoft not responsible for any defects in the software. The software doesn't even have to work! To purchase such software and agree to this EULA, you'd have to be a complete idiot, but that's how people are.

      The only sane answer is to simply not use MS crapware.

      • (Score: 2) by EvilJim on Tuesday September 23 2014, @05:05AM

        by EvilJim (2501) on Tuesday September 23 2014, @05:05AM (#97034) Journal

        Yeah, there's no way a single organisation could battle down MS's huge heap of lawyers, not even with a battle axe.

        • (Score: 1, Funny) by Anonymous Coward on Tuesday September 23 2014, @09:31PM

          by Anonymous Coward on Tuesday September 23 2014, @09:31PM (#97346)

          not even with a battle axe.

          [Grabs hand crafted battle axe purchased at ren fair] - Challenge accepted!

        • (Score: 2) by Grishnakh on Sunday September 28 2014, @04:04AM

          by Grishnakh (2831) on Sunday September 28 2014, @04:04AM (#99065)

          You know, I'd kinda like to see a Medieval warrior hacking away at a bunch of lawyers with a battle axe.

          • (Score: 2) by EvilJim on Sunday September 28 2014, @06:05AM

            by EvilJim (2501) on Sunday September 28 2014, @06:05AM (#99086) Journal

            It would be very impressive, but somehow I dont see it happening.

      • (Score: 3, Insightful) by Common Joe on Tuesday September 23 2014, @05:41AM

        by Common Joe (33) <common.joe.0101NO@SPAMgmail.com> on Tuesday September 23 2014, @05:41AM (#97038) Journal

        To purchase such software and agree to this EULA, you'd have to be a complete idiot, but that's how people are. The only sane answer is to simply not use MS crapware.

        To say that MS is crapware and the EULA doesn't hold MS responsible is spot on, but you're making me cringe in the last 4 out of 5 sentences you wrote.

        I use LibreOffice a lot more than Word or Excel, but LibreOffice has its set of problems too. The EULA for LibreOffice doesn't hold them responsible either so that is a bad argument to convince people to stop using Microsoft.

        Next, you are calling me and my wife idiots. To work with business, we need the Microsoft products. My wife does translation and the only product that her customers use requires Microsoft Word. If she wants to put bread on the table, she has no choice but to buy Microsoft software. Quite frankly, there are many other people in the same boat as she is.

        I'm with what you say in spirit, but the way you (and many others) phrase this kind of stuff in absolutes will drive people away from ever trying LibreOffice because of how insulting you sound.

        I'm not taking the idiot comment personally. I just wanted to mention this as food for thought.

        • (Score: 2) by EvilJim on Tuesday September 23 2014, @06:01AM

          by EvilJim (2501) on Tuesday September 23 2014, @06:01AM (#97042) Journal

          agree there, I didn't take it personally either, in business you have to use what everyone else uses... to a point, anyone know how well compatibility between MS and other office suites works these days? are there still visual differences rendering the same word formatted doc between MS and Other suites?

          • (Score: 2) by Nerdfest on Tuesday September 23 2014, @10:10AM

            by Nerdfest (80) on Tuesday September 23 2014, @10:10AM (#97088)

            I've been using only LibreOffice in an MS shop for about the last three years and have yet to run into a problem. I write fairly complex documents, but haven't really tried their change tracking features.

          • (Score: 2) by Common Joe on Tuesday September 23 2014, @02:37PM

            by Common Joe (33) <common.joe.0101NO@SPAMgmail.com> on Tuesday September 23 2014, @02:37PM (#97171) Journal

            Ha... there are compatibility differences between different versions of Microsoft Word. My experiences have been if you stick to the basics, MS Word / Excel are pretty much compatible with LibreOffice Writer / Calc. It's when you start to get more fancy that the problems arise. There's a neat feature in Excel that changes a cell from a number into a bar graph. To my knowledge, that doesn't exist in LibreOffice. I think some of the features in charts in Excel don't exist in Calc either. Nerdfest's concerns about tracking features in Word / Writer are correct. My personal experience is that the two office suites are not compatible. If you stick to numbers and formulas in cells or basic formatting in a word processing doc, you'll probably be ok. Don't even think about macros.

            I'm not knocking LibreOffice. Sometimes, people put too much effort to make things "look good" when they are just really making it unnecessarily complicated. I specifically use LibreOffice for all personal files because I want the open source compatibility. There's a word processing document that I've been using for almost two decades now. (Let's just call it a very long term project and leave it at that.) It started off in Word Perfect 5.2 (for Windows 3.11), then I updated it to OpenOffice then LibreOffice. I feel pretty safe that I'll be able to work with it for a long time. I got burned when I used PFS Write for DOS over two decades ago. I have files that I'd love to "reanimate", but nothing today will open those files and the only thing I have that is readable are the paper print outs. I learned a valuable lesson on those files.

        • (Score: 2) by Grishnakh on Sunday September 28 2014, @04:01AM

          by Grishnakh (2831) on Sunday September 28 2014, @04:01AM (#99064)

          I use LibreOffice a lot more than Word or Excel, but LibreOffice has its set of problems too. The EULA for LibreOffice doesn't hold them responsible either so that is a bad argument to convince people to stop using Microsoft.

          Huh? Why is that? Of course the EULA for LibreOffice doesn't hold them responsible: you didn't pay anything for it, did you? Maybe they should change it to say if you have a problem, they'll pay you 1 million times what you paid the LO foundation for it. 1e06 x 0 = 0.

          The problem with MS crapware is that companies pay an outright fortune for it, and not only is there zero recourse if there's a problem, they're locked in by the secret file formats (yes, you can use LO/OO, but it's not 100% and the only reason you can use them at all is because the MS formats were reverse-engineered, not because they were open specs). LO natively uses the Open Document Format, which is a Free and open specification (unlike MS's OOXML which is only partially open, and doesn't fully document the standard), so you're not locked into LO, you can switch to anything else that uses ODF, and if you really need to, you can even examine the source code of LO, or modify it if you wish. (For Joe Blow, that probably isn't that helpful, but if you're a 100k-employee corporation, that could be handy for adding customizations. At that scale, the amount you save by not purchasing MS licenses will easily pay for a team of developers to make changes to F/OSS software that you need.).

          As for "idiots", maybe I didn't write that that well, obviously there's corner cases like yours. But I leave you with a quote from a Demotivational poster: "None of us is as stupid as all of us." Collectively, we're a bunch of idiots because we do use MS software.

  • (Score: 2) by frojack on Tuesday September 23 2014, @03:43AM

    by frojack (1554) on Tuesday September 23 2014, @03:43AM (#97022) Journal

    I import stuff all the time from excel all the time, usually into Libre Office, sometimes into OpenOffice and the worst I've had go wrong was some bizzaro (and it turns out, totally unnecessary) vb scripting didn't translate.

    --
    No, you are mistaken. I've always had this sig.
    • (Score: 2) by EvilJim on Tuesday September 23 2014, @03:53AM

      by EvilJim (2501) on Tuesday September 23 2014, @03:53AM (#97024) Journal

      I've had dropdown menus fail to appear in OpenOffice but the most recent time I built a spreadsheet for someone to use on an android tablet it all transfered fine. VB script that I use for work, I wouldn't want to try transferring that between office suites, I imagine it be a fustercluck

    • (Score: 2) by BasilBrush on Tuesday September 23 2014, @09:15PM

      by BasilBrush (3994) on Tuesday September 23 2014, @09:15PM (#97341)

      That says no more than Libreoffice has done a fair amount of work to convert it well. It could also mean that you don't use non-ASCII characters in your spreadsheets much. But this story isn't about LibreOffice, but general data interchange woes, especially when using Windows software that uses old code pages rather than modern unicode.

      The fact is that when you are creating software that deals with non-ASCII data, it is non-trivial. Even two words, both represented in UTF8, may be encoded differently due to different options of combining marks. Data from different sources may not match keys or searches even when it looks like they should.

      --
      Hurrah! Quoting works now!
  • (Score: 3, Interesting) by jackb_guppy on Tuesday September 23 2014, @06:01AM

    by jackb_guppy (3560) on Tuesday September 23 2014, @06:01AM (#97043)

    I use EBCDIC a lot, main machine we do processing on. I had one that each major interface (terminal, web, and ftp) used a different CCSID to translate from EBCDIC to / from ASCII. One used the oldest version that was the keyborad map between ACSII terminal and 5250 terminal (normally ! | are swapped, plus few others). Made displaying data in a web screen (used for reports) not match what was entered by hand.

    Only solution was to change the CCSID on all interfaces and when a bad document was found, to open it manually fix the data.

    • (Score: 3, Interesting) by PizzaRollPlinkett on Tuesday September 23 2014, @11:33AM

      by PizzaRollPlinkett (4512) on Tuesday September 23 2014, @11:33AM (#97111)

      EBCIDC has two pipe | characters, solid and broken. I had a terminal emulator program once with a file upload feature. My version worked great. A point-point release someone else had changed the solid to broken pipe when you uploaded a file. So any program source with a | in it would not compile. Then a later point-point release fixed the bug. This kind of stuff was a time sink that kept people from doing anything productive.

      Of course, dinking around with MS Word file formats is a time sink, too, compared to something like Latex. Sure, Latex is a time sink at first, but after a while you get all the packages and stuff installed that you need, and examples of how to do everything, and it's really productive. MS Word isn't fit for anything other than a one-page letter. It's just not capable of typesetting anything longer than that with any consistency and usability.

      Speaking of Latex, the last thing I printed out looked really small. Had the font shrunk? What's going on? I finally figured it out what had happened: In the past, every time I installed Tex Live in Linux, the paper had defaulted to US letter. The last time I did it, though, the default paper size changed to UK A4 paper. Why this happened, I have no idea. Seems really bizarre to change the default paper size. I had to reinstall Linux after a disk crash earlier this year. I installed the same Linux distro and the same Tex Live, so nothing should have changed.

      Sometimes I wonder why people use computers at all.

      --
      (E-mail me if you want a pizza roll!)
      • (Score: 2) by jackb_guppy on Tuesday September 23 2014, @05:06PM

        by jackb_guppy (3560) on Tuesday September 23 2014, @05:06PM (#97244)

        Not to forget 3 blanks:
        . x40 blank/space
        . x41 NOT blank/space
        . xE0 another blank/space
        Lots of fun converting old display writer documents. Handy, to handed space and  .

  • (Score: 5, Informative) by gringer on Tuesday September 23 2014, @06:08AM

    by gringer (962) on Tuesday September 23 2014, @06:08AM (#97045)

    You don't need to bother with file formats to have trouble matching patients up, that happens already. The following are things that I have encountered in matching people between two different medical databases. A lot can happen with just first name + last name + DOB as the unique match key, and in many cases resolving the problems requires someone with actual knowledge of the patients to fix:

    • Name spelling differences (e.g. "Sarah" vs "Sara")
    • Transcriptional errors (e.g. "Jenjifer" vs "Jennifer")
    • Pre-marriage name matched with post-marriage name
    • Different transliteration of the same name (e.g. "John" and "Hoani")
    • Transposing months and days for DOB (e.g. "12/8" vs "8/12")
    • Transposing day numbers for DOB (e.g. "12 May" vs "21 May")
    • Getting the century wrong for DOB (e.g. "13/12/2005" vs "13/12/1905")
    • Putting the visit date as DOB
    • Same name, same DOB, different person
    • Individuals not present in one database that "has everyone in it"
    • Assigning two different people the same "unique" ID

    In comparison to that, a simple file encoding issue is a very easy (and obvious) fix.

    --
    Ask me about Sequencing DNA in front of Linus Torvalds [youtube.com]
    • (Score: 3, Insightful) by tonyPick on Tuesday September 23 2014, @06:33AM

      by tonyPick (1237) on Tuesday September 23 2014, @06:33AM (#97050) Homepage Journal

      I've mentioned this before, but another point is Unicode is particularly painful in this respect. You have characters which look the same but aren't, as well as characters which should be the same but have different encodings, and character combinations which are the same as single glyphs (so U+006E = "n", and U+0303 = "◌̃", which looks identical to U+00F1 = "ñ")

      There's a ton of rules on how to handle this, but it's not simple, and in untreated data can be a major headache...
      http://en.wikipedia.org/wiki/Unicode_equivalence [wikipedia.org]

  • (Score: 1) by simonInOz on Tuesday September 23 2014, @07:13AM

    by simonInOz (2173) on Tuesday September 23 2014, @07:13AM (#97059)

    Talking of encoding, it'd be nice if the email SoylentNews sends to my Gmail every day was in the correct coding, so I didn't get weird characters for quotes and stuff. What the heck are they doing?

    --
    -- cats like plain crisps --
    • (Score: 3, Informative) by NCommander on Tuesday September 23 2014, @09:30AM

      by NCommander (2) Subscriber Badge <michael@casadevall.pro> on Tuesday September 23 2014, @09:30AM (#97080) Homepage Journal

      Likely something we overlooked on overhauling the codebase to use UTF-8 internally.

      Talking about the article slashcode itself would count for this; the original implementation of "UTF-8" support basically took UTf-8 input, converting it to local charset, and stored the special characters as HTML refs in the database. Then when they realized it was a bad idea, they "fixed" it by forcibly masking out the top bit of of each character, which is what caused the bizare formating issues we all know about on the other site.

      --
      Still always moving
      • (Score: 1) by simonInOz on Friday September 26 2014, @02:53AM

        by simonInOz (2173) on Friday September 26 2014, @02:53AM (#98474)

        So is there any chance of fixing it? It is irritating every day, every quot gets weird characters around it.

        --
        -- cats like plain crisps --
        • (Score: 2) by NCommander on Saturday September 27 2014, @12:55AM

          by NCommander (2) Subscriber Badge <michael@casadevall.pro> on Saturday September 27 2014, @12:55AM (#98776) Homepage Journal

          It should be fixed. I haven't seen any excess stupidity in recent posts related to it. Link me to any that you've seen in the last month and I'll get our crack team of ninjas debugging it.

          --
          Still always moving
  • (Score: 2, Interesting) by Anonymous Coward on Tuesday September 23 2014, @07:57AM

    by Anonymous Coward on Tuesday September 23 2014, @07:57AM (#97066)

    Both MS Outlook for Mac, and Entourage would (and probably still do) generate garbage character encodings in text/plain parts.
    I've often received email containing a text/plain and text/html part, since I use mutt, I generally read the text/plain part.

    Often there are bad characters in places, usually where the sender program mis-encoded 'smart quotes'. If one examines the
    text/html part, once can determine the intended characters, if one reads the text/plain part some of these usually become
    something like '1/2' and '1/4' codepoints. That is because the text/plain is marked as iso-8859-1, whereas it is not using such
    encoding. Moreover, interpreting it as Windows 1252 doesn't help either. These bad characters are still junk.

  • (Score: 2) by Aiwendil on Tuesday September 23 2014, @07:59AM

    by Aiwendil (531) on Tuesday September 23 2014, @07:59AM (#97067) Journal

    (On-topic: Why didn't they set up a search-system that wasn't encoding-aware, or required all submitted documents to be in a given encoding?)

    This was once a matter or course when moving across systems (even within the same OS, you just needed to cross the language/alphabet barrier) and pretty much the same stuff is required.

    However the one thing I still encounter quite often is the handling of swedish letters (å, ä, ö (aring, auml, ouml)) in zip-files* when going from linux to windows, normally I just end up using convmv(1) prior to zip:ing it all..

    * = Does anyone know of a zip-program for linux commandline that handles recursion and on-the-fly transcoding between charsets of directory and filenames? (the only thing I can find is where it changes how things are encoded in the system's charset, not specifying target charset with transcoding. main requirement is that the resulting zip should work fine in the builtin unpackers of MacOS X and Windows (XP and later))

  • (Score: 1, Offtopic) by GreatAuntAnesthesia on Tuesday September 23 2014, @08:55AM

    by GreatAuntAnesthesia (3275) on Tuesday September 23 2014, @08:55AM (#97074) Journal

    I know this is a geek site but seriously?

    > any interesting character encoding stories?

    If you were to ask me about the least interesting thing I could possibly come up with a story for, it would probably be character encoding. I think I'd cold com u pwith more interesting stories about soil mechanics, or potato farming, or the "Sex in the City" films than character encoding.

    • (Score: 4, Interesting) by mojo chan on Tuesday September 23 2014, @11:14AM

      by mojo chan (266) on Tuesday September 23 2014, @11:14AM (#97100)

      I've learned a few interesting things about character encoding over the years. C# seems to have big problems with it, for example, and makes doing simple stuff like parsing a date in European format difficult. The latter wouldn't be so bad if they used ISO format, but instead they picked the extremely random US format at the default and only easily available option.

      I decided to pack some 12 bit data as two samples per three bytes once. Makes storage and transmission more efficient, but the desktop coders couldn't handle it. I ended up doing the conversion back to unpacked 16 bit words on the embedded side for them. I think they managed to split the bytes into two 16 bit numbers, but struggled to sign extend them.

      Unicode is broken, badly. Japanese, Chinese and Korean share character codes for characters that are actually different in each language. They look similar and derive from the same origin, but they are still different. You can't encode certain things in Unicode, such as metadata tags containing both Chinese and Japanese names (e.g. two artists do a duet) or basically anything that mixes languages and doesn't support it's own per-character language metadata. It's a complete disaster.

      --
      const int one = 65536; (Silvermoon, Texture.cs)
  • (Score: 0) by Anonymous Coward on Tuesday September 23 2014, @08:56AM

    by Anonymous Coward on Tuesday September 23 2014, @08:56AM (#97075)

    Or at least extract the funds to do so from their pockets. This is pure abuse of their monopoly over computers. And to believe it was accidental or happened because of lack of knowledge or skill would be incredible naivete. This is proprietary lock-in 101 stuff.

  • (Score: 3, Informative) by HyperQuantum on Tuesday September 23 2014, @10:14AM

    by HyperQuantum (2673) on Tuesday September 23 2014, @10:14AM (#97089)

    What's particularly bad about Excel is that it reads and writes CSV files in a locale-dependent way. It uses the decimal separator from the Windows locale instead of a '.', and uses the so called list separator character as field separator instead of a comma. Which means that CSV files are not portable across systems with a different locale setting when using Excel. On my system I have the comma set as decimal separator and the semicolon as list separator character, so standard CSV files generated by programs other than Excel do not open in Excel correctly, and vice-versa.

    And then, of course, there is also the character encoding problem as decribed in the article. Excel uses/assumes Windows-1252 encoding while most other programs use UTF-8.

    • (Score: 1, Informative) by Anonymous Coward on Tuesday September 23 2014, @01:02PM

      by Anonymous Coward on Tuesday September 23 2014, @01:02PM (#97136)

      That's why I use tab-separated values instead. I've never come across a situation where I *needed* tabs to be preserved in the source data. That's not to say they couldn't exist, but commas, periods, and semicolons are used too frequently in various number, currency, and data formats to be useful as field separators in data. Yeah, columns don't always line up if you open the file in a text editor or something, but they don't when separated by these other separator characters either.

      • (Score: 3, Informative) by velex on Tuesday September 23 2014, @01:52PM

        by velex (2068) on Tuesday September 23 2014, @01:52PM (#97148) Journal

        Speaking of Excel and tab-delimited, my annoyance is that it puts quotes around values that contain commas. I believe it also escapes double quotes by putting another (""), but that doesn't come up as often as commas. It's easy enough to fix/work around, but annoying.

    • (Score: 2, Informative) by francois.barbier on Tuesday September 23 2014, @06:22PM

      by francois.barbier (651) on Tuesday September 23 2014, @06:22PM (#97282)

      Have you ever tried opening a CSV file with the first line starting with "ID"?
      Like, it's not something I do a lot, exporting data to CSV with the first column being the primary key, right?
      Look at here: http://support.microsoft.com/kb/215591 [microsoft.com]
      That's right, even if the file has a CSV extension (which is Microsoft way of telling the content-type, remember?), it tries to open it as SYLK file, whatever it is...
      And then it fails, lamentably.
      It could have then fallen back to CSV parsing, but NO!
      It won't open it!
      Excel is useless POS.

      • (Score: 1) by francois.barbier on Tuesday September 23 2014, @06:28PM

        by francois.barbier (651) on Tuesday September 23 2014, @06:28PM (#97288)

        Er...
        Wrong KB article, I should have used Bing, I know! http://support.microsoft.com/kb/323626 [microsoft.com]
        Also note:

        Applies to the following versions:
        - Microsoft Office Excel 2003
        - Microsoft Excel 2002
        - Microsoft Excel 2000 Standard
        - Microsoft Excel 97 Standard

        This will soon be bugging me for 20 years!

  • (Score: 2) by MrGuy on Tuesday September 23 2014, @12:21PM

    by MrGuy (1007) on Tuesday September 23 2014, @12:21PM (#97122)

    Does anyone have any interesting character encoding stories?

    I assure you, there is no such thing as an "interesting" character encoding story.

    • (Score: 2) by lhsi on Tuesday September 23 2014, @12:52PM

      by lhsi (711) on Tuesday September 23 2014, @12:52PM (#97134) Journal

      Should I have asked for horrendous character encoding stories instead?

  • (Score: 2) by zeigerpuppy on Tuesday September 23 2014, @02:30PM

    by zeigerpuppy (1298) on Tuesday September 23 2014, @02:30PM (#97167)

    Had an interesting time inserting Sanskrit into an English document the other day. Not too hard to get right with LaTeX.
    Also check out sharelatex, the authors generously open sourced it this year...