Stories
Slash Boxes
Comments

SoylentNews is people

SoylentNews is powered by your submissions, so send in your scoop. Only 16 submissions in the queue.
posted by janrinok on Friday August 26 2016, @08:13AM   Printer-friendly
from the not-so-bright-scientists dept.

Scientific literature often mis-names genes and boffins say Microsoft Excel is partly to blame.

"Automatic conversion of gene symbols to dates and floating-point numbers is a problematic feature of Excel software," In a paper titled write Mark Ziemann, Yotam Eren and Assam El-OstaEmai of the Baker IDI Heart & Diabetes Institute in Australia in a paper titled Gene name errors are widespread in the scientific literature .

Among the things Excel does to gene names include changing "SEPT2", the name of a gene thought to have a role in proper formation of cell structure, to the date "2-Sep". The "MARCH1" gene becomes "1-Mar".

The paper notes that this is a problem that's been know for over a decade, but one which remains pervasive. The trio studied 35,175 Excel tables attached to 3,597 scientific papers published between 2005 and 2015 and found errors in "987 supplementary files from 704 published articles. Of the selected journals, the proportion of published articles with Excel files containing gene lists that are affected by gene name errors is 19.6 per cent."

It's not hard to change the default format of Excel cell to avoid changes of this sort: you can get it done in a click or three. Much of the problem in these papers is therefore between scientists' ears, rather than within Excel itself. The paper's silent on why genetic scientists, who The Register will assume are not short of intelligence, have been making Excel errors for years.

This article focuses on errors resulting from auto-correction of gene names; certainly other subject areas have suffered from similarly 'helpful' software. What hilarious and/or cringe-worthy 'corrections' have YOU seen?


Original Submission

 
This discussion has been archived. No new comments can be posted.
Display Options Threshold/Breakthrough Mark All as Read Mark All as Unread
The Fine Print: The following comments are owned by whoever posted them. We are not responsible for them in any way.
  • (Score: 2, Insightful) by linuxrocks123 on Friday August 26 2016, @08:34AM

    by linuxrocks123 (2557) on Friday August 26 2016, @08:34AM (#393401) Journal

    And don't use LibreOffice Calc, either. They should have at least one person on their team that can code. Have that person write R, Python, FORTRAN, whatever to process the data and spit it out as a CSV. Then open that CSV in Excel for display only, or, if you have a really enlightened workflow, translate it directly into LaTeX tables.

    Spreadsheets should never have been used as anything but toys. For anything serious, you need either code to generate a CSV from your data source(s), or use a proper database, depending on what you're abusing spreadsheets for.

    Starting Score:    1  point
    Moderation   0  
       Insightful=1, Overrated=1, Total=2
    Extra 'Insightful' Modifier   0  
    Karma-Bonus Modifier   +1  

    Total Score:   2  
  • (Score: 4, Insightful) by shrewdsheep on Friday August 26 2016, @08:46AM

    by shrewdsheep (5215) Subscriber Badge on Friday August 26 2016, @08:46AM (#393404)

    Yes, spreadsheets totally suck in the scientific workflow. Nevertheless, the typical lab-oriented researcher has no other way of handling his or her data. However, your suggestion to work with CSVs sucks even more in my experience. Not only, does it not prevent the automatic conversions, you also start to depend on locale settings (excel uses ; as separator in a European locale as , seperates integer and fractional part of a number). What works best is to export directly to xls-files (coming from R) in my experience. The resulting files do have the right column types and will be guaranteed to work in every OS. To assure overall sanity, I also produce pdf-output (yes, by going through latex).

    • (Score: 2) by martyb on Friday August 26 2016, @11:29AM

      by martyb (76) on Friday August 26 2016, @11:29AM (#393432) Journal

      Comma-Separated Value (CSV) files have other heartaches for those who are not wary.

      In my experience, there is no single, definitive format for CSV files — each vendor seems to have their own take on things. This most frequently tortured me when dealing with how different products 'escaped' certain values. From memory: double quote ("), comma (,), and semi-colon (;).

      Oh, when you *really* mean that you want a double quote as part of the value in a cell, just enter it twice... or do you precede it with an escape character like slash (/)?

      Wikipedia explains CSV [wikipedia.org] it better than I can this early in the morning:

      In computing, a comma-separated values (CSV) file stores tabular data (numbers and text) in plain text. Each line of the file is a data record. Each record consists of one or more fields, separated by commas. The use of the comma as a field separator is the source of the name for this file format.

      The CSV file format is not standardized. The basic idea of separating fields with a comma is clear, but that idea gets complicated when the field data may also contain commas or even embedded line-breaks. CSV implementations may not handle such field data, or they may use quotation marks to surround the field. Quotation does not solve everything: some fields may need embedded quotation marks, so a CSV implementation may include escape characters or escape sequences. In addition, the term "CSV" also denotes some closely related delimiter-separated formats that use different field delimiters. These include tab-separated values and space-separated values. A delimiter that is not present in the field data (such as tab) keeps the format parsing simple. These alternate delimiter-separated files are often even given a .csv extension, despite the use of a non-comma field separator. This loose terminology can cause problems in data exchange. Many applications that accept CSV files have options to select the delimiter character and the quotation character.

      In retrospect, what was needed was an initial record in the file which defined the separator and quoting characters. Analogous to a shell script which starts with a bang-path to differentiate shell, bourne shell, bash, perl, etc. programs.

      At this stage, IMHO. possibly the best one could hope for is a standard test file which exhaustively expresses these possible misinterpretations and makes clear what is done in each case.

      --
      Wit is intellect, dancing. I'm too old to act my age. Life is too important to take myself seriously.
      • (Score: 2) by digitalaudiorock on Friday August 26 2016, @01:53PM

        by digitalaudiorock (688) on Friday August 26 2016, @01:53PM (#393472) Journal

        Comma-Separated Value (CSV) files have other heartaches for those who are not wary.

        Absolutely. Tab separated values without quotes are 1000 times better. I've never understood why a format like CSV, which depends on separators that are so likely to appear in data became so common.

        • (Score: 3, Insightful) by bzipitidoo on Friday August 26 2016, @03:19PM

          by bzipitidoo (4388) on Friday August 26 2016, @03:19PM (#393521) Journal

          > separators that are so likely to appear in data

          That's always the problem. Every time a new way to separate information was created, it became embedded, already used so that another method of separation had to be created to avoid confusion. Sometimes that leads to monstrosities such as HTML. Originally, writing did not have punctuation-- no commas, periods, semicolons and colons, and not even spaces. Yes, the space is considered punctuation.

          Heck, much ancient writing didn't even have vowels. Vowels are formally part of the alphabet, have been for centuries, but punctuation still is not, and children are not taught a list of punctuation symbols in the same manner as taught the alphabet and numerical digits. English speakers all know that the English alphabet has 26 letters, but can anyone say how many standard punctuation symbols there are off the top of their heads? Most people would overlook the space, despite that being by far the largest key on the typical keyboard. Or, more like wouldn't think of it as punctuation. Punctuation is just quietly slipped in as something of an afterthought while teaching writing. Much early writing uses a mid level dot ยท to separate most words but not all, as it was only a tool to reduce ambiguity and fairly often there was no ambiguity and so no need for a word separator.

          We have an ongoing debate between fixed and variable width fonts. I have yet to see a programming language that is written in a proportional font. Think what a mess that would make of Python especially. For separation schemes that depend on spacing, placing symbols on a grid is essential. Can get away with a proportional font for CSV. But for more sophisticated schemes for separating data elements, it sure is nice to be able to use spacing.

          • (Score: 2) by digitalaudiorock on Friday August 26 2016, @05:29PM

            by digitalaudiorock (688) on Friday August 26 2016, @05:29PM (#393587) Journal

            We have an ongoing debate between fixed and variable width fonts. I have yet to see a programming language that is written in a proportional font. Think what a mess that would make of Python especially. For separation schemes that depend on spacing, placing symbols on a grid is essential. Can get away with a proportional font for CSV. But for more sophisticated schemes for separating data elements, it sure is nice to be able to use spacing.

            Wow...I'm not sure I've ever read anything that's confused me more. A delimited text format is intended for data transport purposes and simply must be parsable by whatever uses it. What on earth do fonts have to do with any of that in any way?

            • (Score: 0) by Anonymous Coward on Friday August 26 2016, @08:52PM

              by Anonymous Coward on Friday August 26 2016, @08:52PM (#393679)

              I have no clue what fonts have to do with it but I was waiting for the GP to complain that RPG columns don't line up with variable width fonts. Yes, RPG - I went there because I'm ooooold school.

      • (Score: 1) by nitehawk214 on Friday August 26 2016, @02:31PM

        by nitehawk214 (1304) on Friday August 26 2016, @02:31PM (#393496)

        In my experience, there is no single, definitive format for CSV files.

        My current project is importing CSV with the following format.
        1. Blindly place double quotes around every string, whether it is needed or not.
        2. Do not escape double quotes inside the strings, screwing most parsers over
        3. Place random carriage returns in the quoted strings. Failing anything with extra double quotes.

        Of course the data inside the CSV is not well structured at all, anyhow.

        Opencsv [mvnrepository.com] does a pretty decent job on it, but a lot of lines get rejected.

        Fuck CSV.

        --
        "Don't you ever miss the days when you used to be nostalgic?" -Loiosh
      • (Score: 5, Informative) by jdccdevel on Friday August 26 2016, @05:06PM

        by jdccdevel (1329) on Friday August 26 2016, @05:06PM (#393572) Journal

        What bothers me is that ASCII even has characters for exactly this purpose! If people had just used the characters specified by the standards properly in their file format, we could have embedded any kind of readable text in a record (newlines, quotes, tabs, commas, whatever), without having to resort to escape sequences at all!

        From the ASCII table:
        0x1D => Group Separator (to indicate the end of a row)
        0x1E => Record Separator (to separate columns)

        Seriously, ASCII has been around for How Long? Why the hell are we always re-inventing the wheel, and with something that doesn't work as well?!

        How many man hours would have been saved if someone had put just a little more thought into their file format?

        </rant>

        Sorry about that, the article seems to have struck a nerve.

        • (Score: 0) by Anonymous Coward on Friday August 26 2016, @08:58PM

          by Anonymous Coward on Friday August 26 2016, @08:58PM (#393681)

          I use 1C (FS), 1D (GS), 1E (RS) and 1F (US) in my data all the time, especially in Ajax data transmissions. Like you said, that's why they're there.

    • (Score: 2) by linuxrocks123 on Friday August 26 2016, @05:20PM

      by linuxrocks123 (2557) on Friday August 26 2016, @05:20PM (#393581) Journal

      Choose whatever CSV format you want, and open the CSV file in LibreOffice to avoid automatic conversions if you must open the CSV in a spreadsheet to export for display. As long as your own scripts are consistent about the separators you use etc., you'll be fine.

      The Python CSV library has never given me trouble.

  • (Score: 3, Insightful) by takyon on Friday August 26 2016, @08:46AM

    by takyon (881) <takyonNO@SPAMsoylentnews.org> on Friday August 26 2016, @08:46AM (#393405) Journal

    They should have at least one person on their team that can code.

    That sounds like something that will require planning and cost more money. Why pay for expertise when you can roll the dice and be part of the 80%?

    --
    [SIG] 10/28/2017: Soylent Upgrade v14 [soylentnews.org]
    • (Score: 0) by Anonymous Coward on Friday August 26 2016, @12:08PM

      by Anonymous Coward on Friday August 26 2016, @12:08PM (#393442)

      yeah. if you are familiar with the amount of funding going to researchers, you will not blame them for not paying programmers.
      furthermore, there are rules about how research grants are supposed to be used... for a whole bunch of categories of funds, you are not allowed to use them for paying anyone but a scientist.
      when i was at a big university in the US, the IT support could only be payed by the university, hence they were ridiculously overworked, and we couldn't get things done.
      to expect those people to also babysit all the scientists would have been kind of absurd (and they couldn't have done it anyway).

  • (Score: 0) by Anonymous Coward on Friday August 26 2016, @09:42AM

    by Anonymous Coward on Friday August 26 2016, @09:42AM (#393413)

    Spreadsheets should never have been used as anything but toys.

    They work fine for recording points in courses. I wouldn't consider that a toy application.

    However I agree that in science work, they have no place.

    • (Score: 0) by Anonymous Coward on Friday August 26 2016, @03:23PM

      by Anonymous Coward on Friday August 26 2016, @03:23PM (#393525)

      Spreadsheets are very useful for manually entering tables of data, and quickly displaying tables of data with colors and such, especially in situations where color-coding cells based upon their values provides valuable information. That stuff is quick to do, but is practically limited to a small amount of data that is easy to ingest by eye. However, doing anything more than just calculating sums, means, standard deviations, etc., it can be a very painful tool to use.

      I cringe when I see Excel plots, particularly when the author hasn't taken the time to change the very ugly default layout. I've seen refereed papers that have ugly default Excel plots where the plot was reduced in size, but the axes labels stay the same resulting in a plot that is predominantly axis and labels at the expense of the plot region.