Scientific literature often mis-names genes and boffins say Microsoft Excel is partly to blame.
"Automatic conversion of gene symbols to dates and floating-point numbers is a problematic feature of Excel software," In a paper titled write Mark Ziemann, Yotam Eren and Assam El-OstaEmai of the Baker IDI Heart & Diabetes Institute in Australia in a paper titled Gene name errors are widespread in the scientific literature .
Among the things Excel does to gene names include changing "SEPT2", the name of a gene thought to have a role in proper formation of cell structure, to the date "2-Sep". The "MARCH1" gene becomes "1-Mar".
The paper notes that this is a problem that's been know for over a decade, but one which remains pervasive. The trio studied 35,175 Excel tables attached to 3,597 scientific papers published between 2005 and 2015 and found errors in "987 supplementary files from 704 published articles. Of the selected journals, the proportion of published articles with Excel files containing gene lists that are affected by gene name errors is 19.6 per cent."
It's not hard to change the default format of Excel cell to avoid changes of this sort: you can get it done in a click or three. Much of the problem in these papers is therefore between scientists' ears, rather than within Excel itself. The paper's silent on why genetic scientists, who The Register will assume are not short of intelligence, have been making Excel errors for years.
This article focuses on errors resulting from auto-correction of gene names; certainly other subject areas have suffered from similarly 'helpful' software. What hilarious and/or cringe-worthy 'corrections' have YOU seen?
(Score: 3, Informative) by Anonymous Coward on Friday August 26 2016, @08:29AM
Excel can be stubborn: http://arstechnica.com/civis/viewtopic.php?t=282169 [arstechnica.com]
I've had similar experiences with Excel being stubborn - I've done stuff like select some cells/columns then set the format to General and yet Excel continues to treat the formulas in those cells as text.
So I wouldn't blame it all on the scientists.
(Score: 1) by nitehawk214 on Friday August 26 2016, @02:24PM
I would blame it on the scientists for using such a poor tool.
"Don't you ever miss the days when you used to be nostalgic?" -Loiosh
(Score: 2, Insightful) by linuxrocks123 on Friday August 26 2016, @08:34AM
And don't use LibreOffice Calc, either. They should have at least one person on their team that can code. Have that person write R, Python, FORTRAN, whatever to process the data and spit it out as a CSV. Then open that CSV in Excel for display only, or, if you have a really enlightened workflow, translate it directly into LaTeX tables.
Spreadsheets should never have been used as anything but toys. For anything serious, you need either code to generate a CSV from your data source(s), or use a proper database, depending on what you're abusing spreadsheets for.
(Score: 4, Insightful) by shrewdsheep on Friday August 26 2016, @08:46AM
Yes, spreadsheets totally suck in the scientific workflow. Nevertheless, the typical lab-oriented researcher has no other way of handling his or her data. However, your suggestion to work with CSVs sucks even more in my experience. Not only, does it not prevent the automatic conversions, you also start to depend on locale settings (excel uses ; as separator in a European locale as , seperates integer and fractional part of a number). What works best is to export directly to xls-files (coming from R) in my experience. The resulting files do have the right column types and will be guaranteed to work in every OS. To assure overall sanity, I also produce pdf-output (yes, by going through latex).
(Score: 2) by martyb on Friday August 26 2016, @11:29AM
Comma-Separated Value (CSV) files have other heartaches for those who are not wary.
In my experience, there is no single, definitive format for CSV files — each vendor seems to have their own take on things. This most frequently tortured me when dealing with how different products 'escaped' certain values. From memory: double quote ("), comma (,), and semi-colon (;).
Oh, when you *really* mean that you want a double quote as part of the value in a cell, just enter it twice... or do you precede it with an escape character like slash (/)?
Wikipedia explains CSV [wikipedia.org] it better than I can this early in the morning:
In retrospect, what was needed was an initial record in the file which defined the separator and quoting characters. Analogous to a shell script which starts with a bang-path to differentiate shell, bourne shell, bash, perl, etc. programs.
At this stage, IMHO. possibly the best one could hope for is a standard test file which exhaustively expresses these possible misinterpretations and makes clear what is done in each case.
Wit is intellect, dancing.
(Score: 2) by digitalaudiorock on Friday August 26 2016, @01:53PM
Absolutely. Tab separated values without quotes are 1000 times better. I've never understood why a format like CSV, which depends on separators that are so likely to appear in data became so common.
(Score: 3, Insightful) by bzipitidoo on Friday August 26 2016, @03:19PM
> separators that are so likely to appear in data
That's always the problem. Every time a new way to separate information was created, it became embedded, already used so that another method of separation had to be created to avoid confusion. Sometimes that leads to monstrosities such as HTML. Originally, writing did not have punctuation-- no commas, periods, semicolons and colons, and not even spaces. Yes, the space is considered punctuation.
Heck, much ancient writing didn't even have vowels. Vowels are formally part of the alphabet, have been for centuries, but punctuation still is not, and children are not taught a list of punctuation symbols in the same manner as taught the alphabet and numerical digits. English speakers all know that the English alphabet has 26 letters, but can anyone say how many standard punctuation symbols there are off the top of their heads? Most people would overlook the space, despite that being by far the largest key on the typical keyboard. Or, more like wouldn't think of it as punctuation. Punctuation is just quietly slipped in as something of an afterthought while teaching writing. Much early writing uses a mid level dot · to separate most words but not all, as it was only a tool to reduce ambiguity and fairly often there was no ambiguity and so no need for a word separator.
We have an ongoing debate between fixed and variable width fonts. I have yet to see a programming language that is written in a proportional font. Think what a mess that would make of Python especially. For separation schemes that depend on spacing, placing symbols on a grid is essential. Can get away with a proportional font for CSV. But for more sophisticated schemes for separating data elements, it sure is nice to be able to use spacing.
(Score: 2) by digitalaudiorock on Friday August 26 2016, @05:29PM
Wow...I'm not sure I've ever read anything that's confused me more. A delimited text format is intended for data transport purposes and simply must be parsable by whatever uses it. What on earth do fonts have to do with any of that in any way?
(Score: 0) by Anonymous Coward on Friday August 26 2016, @08:52PM
I have no clue what fonts have to do with it but I was waiting for the GP to complain that RPG columns don't line up with variable width fonts. Yes, RPG - I went there because I'm ooooold school.
(Score: 1) by nitehawk214 on Friday August 26 2016, @02:31PM
In my experience, there is no single, definitive format for CSV files.
My current project is importing CSV with the following format.
1. Blindly place double quotes around every string, whether it is needed or not.
2. Do not escape double quotes inside the strings, screwing most parsers over
3. Place random carriage returns in the quoted strings. Failing anything with extra double quotes.
Of course the data inside the CSV is not well structured at all, anyhow.
Opencsv [mvnrepository.com] does a pretty decent job on it, but a lot of lines get rejected.
Fuck CSV.
"Don't you ever miss the days when you used to be nostalgic?" -Loiosh
(Score: 5, Informative) by jdccdevel on Friday August 26 2016, @05:06PM
What bothers me is that ASCII even has characters for exactly this purpose! If people had just used the characters specified by the standards properly in their file format, we could have embedded any kind of readable text in a record (newlines, quotes, tabs, commas, whatever), without having to resort to escape sequences at all!
From the ASCII table:
0x1D => Group Separator (to indicate the end of a row)
0x1E => Record Separator (to separate columns)
Seriously, ASCII has been around for How Long? Why the hell are we always re-inventing the wheel, and with something that doesn't work as well?!
How many man hours would have been saved if someone had put just a little more thought into their file format?
</rant>
Sorry about that, the article seems to have struck a nerve.
(Score: 0) by Anonymous Coward on Friday August 26 2016, @08:58PM
I use 1C (FS), 1D (GS), 1E (RS) and 1F (US) in my data all the time, especially in Ajax data transmissions. Like you said, that's why they're there.
(Score: 2) by linuxrocks123 on Friday August 26 2016, @05:20PM
Choose whatever CSV format you want, and open the CSV file in LibreOffice to avoid automatic conversions if you must open the CSV in a spreadsheet to export for display. As long as your own scripts are consistent about the separators you use etc., you'll be fine.
The Python CSV library has never given me trouble.
(Score: 3, Insightful) by takyon on Friday August 26 2016, @08:46AM
That sounds like something that will require planning and cost more money. Why pay for expertise when you can roll the dice and be part of the 80%?
[SIG] 10/28/2017: Soylent Upgrade v14 [soylentnews.org]
(Score: 0) by Anonymous Coward on Friday August 26 2016, @12:08PM
yeah. if you are familiar with the amount of funding going to researchers, you will not blame them for not paying programmers.
furthermore, there are rules about how research grants are supposed to be used... for a whole bunch of categories of funds, you are not allowed to use them for paying anyone but a scientist.
when i was at a big university in the US, the IT support could only be payed by the university, hence they were ridiculously overworked, and we couldn't get things done.
to expect those people to also babysit all the scientists would have been kind of absurd (and they couldn't have done it anyway).
(Score: 0) by Anonymous Coward on Friday August 26 2016, @09:42AM
They work fine for recording points in courses. I wouldn't consider that a toy application.
However I agree that in science work, they have no place.
(Score: 0) by Anonymous Coward on Friday August 26 2016, @03:23PM
Spreadsheets are very useful for manually entering tables of data, and quickly displaying tables of data with colors and such, especially in situations where color-coding cells based upon their values provides valuable information. That stuff is quick to do, but is practically limited to a small amount of data that is easy to ingest by eye. However, doing anything more than just calculating sums, means, standard deviations, etc., it can be a very painful tool to use.
I cringe when I see Excel plots, particularly when the author hasn't taken the time to change the very ugly default layout. I've seen refereed papers that have ugly default Excel plots where the plot was reduced in size, but the axes labels stay the same resulting in a plot that is predominantly axis and labels at the expense of the plot region.
(Score: 0) by Anonymous Coward on Friday August 26 2016, @08:52AM
Isn't extinguishing genes going a bit too far, Microsoft?
(Score: 2) by shrewdsheep on Friday August 26 2016, @12:12PM
You mean they first embraced: https://blogs.technet.microsoft.com/windowsserver/2015/05/06/microsoft-loves-linux/ [microsoft.com]
Then extended by means of mutation: https://qph.ec.quoracdn.net/main-qimg-042e163675346e42d9d5c4a3b220d37c-c?convert_to_webp=true [quoracdn.net]
And are now trying to extinghuish the very genes they targeted?
Wait, that was the wrong order this time, or is it the new extinguish-extend-embrace tactic [the power is in the old bugs]?
(Score: 0) by Anonymous Coward on Friday August 26 2016, @09:16AM
Even I have sent an email alerting someone about this issue. I usually don't even look at these gene csvs. I am sure it is widespread, but nothing new unless you thought people doing bio research weren't relying on emailing around closed source spreadsheets.
(Score: 5, Informative) by zocalo on Friday August 26 2016, @09:50AM
If you are going to pick a given tool for a task, you really ought to make sure that you know to make the tool accomplish the task. If you're importing pre-formatted data from elsewhere, e.g. a CSV, which seems the most likely case in Excel; right click on the top-left corner of the sheet, "Format Cells", "Text", "OK" - four mouse clicks and it's a non-issue. Hardly a "power user function" either. PEBKAC.
UNIX? They're not even circumcised! Savages!
(Score: 0) by Anonymous Coward on Friday August 26 2016, @09:56AM
Also, catching these 'errors' is rather trivial. Apply filter on the column with data and then see the filter value list. Any auto conversions as stated in the article are quickly spotted.
(Score: 1, Informative) by Anonymous Coward on Friday August 26 2016, @10:06AM
You can also disable all these auto corrections in the settings. Ditto with LibreOffice (which does it's own nonsense of this sort).
(Score: 3, Insightful) by janrinok on Friday August 26 2016, @02:03PM
So, you are suggesting that they do exactly what it says you should do in TFS? Well, who would have thought that....?
Blaming the messenger when it's a user problem - er no, I think TFS laid the blame quite squarely on the shoulders of the scientists using Excel, I even wrote an appropriate dept to make it obvious: 'not-so-bright-scientists dept'.
I suspect someone didn't read TFS closely enough :-)
(Score: 1) by kurenai.tsubasa on Friday August 26 2016, @03:23PM
Did like the dept line :)
I would heap the blame on the march towards “user friendliness.” User friendly is in the eyes of the user. It turns out that Excel really isn't that user friendly. Except if it came out of the box in a way that scientists and others who need to work with data in a rigorous manner may find more user friendly, legions of PHBs and accountants would blot out the sun with their irritated, angry helpdesk requests. “Why doesn't this stupid thing see that SEPT2 is when the next pay period is over?! What kind of autistic dweeb wrote this?!”
One would hope that the Everybody Can Code! thing would educate people the Excel isn't the only way and often isn't the best way to work with and present data, but one would hope for too much.
I think a lot of it is mostly starts from the irrational fear of the command line. I'm certain Google, Microsoft, Apple et al do have a vested interest in the idea that it's all magick powered by waldos under the hood. (Sure, one could argue a command prompt still doesn't constitute “under the hood,” since it isn't really.) code.org doesn't really do anything to dispel that notion afaict, but I digress.
(Score: 0) by Anonymous Coward on Friday August 26 2016, @04:11PM
(Score: 0) by Anonymous Coward on Friday August 26 2016, @07:11PM
I don't blame the users when these kind of helpful features pop up unnanounced. This automatic data formatting has been a bane of mine as well, and it is something that gets foisted upon the user and get enabled by default. You also have to catch the error, which isn't always obvious when think that it is reasonable that before, when you entered your data, the cell ended up holding the data you typed, but now it works differently. Where I got bit in the ass with this was when entering dates, it was helping me out by reformatting it for me by assuming I wanted DD/MM/YY when I was typing in MM/DD/YY. I'm very sympathetic to users. Sure, you can go in and change this default behavior, once you found out about it, but what about the other helpful features that are enabled by default that you haven't discovered yet? The default on these programs is "do exactly what I tell you, not do what you think I mean" and let the user decide which features to enable.
(Score: 0) by Anonymous Coward on Friday August 26 2016, @07:14PM
I meant the default should be do as I want. Left that part out in haste.
(Score: 5, Insightful) by pTamok on Friday August 26 2016, @09:57AM
It seems each generation is condemned to making the same mistakes.
When I was doing scientific programming three decades ago, it was very, very clear that data should be separate from programs. Data was input using text files, the format of which was documented extremely precisely, and your program read that file putting variables and constants into the correct structures in the program. Room for ambiguity was ruthlessly removed. This was when the language of choice for scientific programming was FORTRAN-77, not C. You validated your programs with test data files designed to check edge cases and oddities. You also scrubbed your data, looking for typos.
In my view Excel is not a suitable tool for serious data analysis. Better, more rigorous, testable tools exist. Working with Excel is like working with someone with sloppy bench habits in a lab: not cleaning up after spills, by-passing safety procedures, not keeping an accurate log-book. You can get away with that for a while, but eventually something blows up in your face - sometimes literally. If you can't be bothered to use the right tool for the job, you shouldn't be doing the job. It is a bad workman who blames his tools. The problems are not Excel's fault, but the fault of whoever chose Excel as the inadequate tool.
(Score: 0) by Anonymous Coward on Friday August 26 2016, @10:41AM
Just because the tool might not be "proper" for the job in that one case, does not mean that the tool can't have the same failure in other, more proper, jobs.
(Score: 3, Insightful) by MostCynical on Friday August 26 2016, @09:59AM
the data *could* have been analyzed without problems, then dumped into excel for publication.
Some journals insist on "standard" formats (ie: microsoft-compatible), rather than "raw" data.
Racing to compete re-writes and submit on time, most researchers will confirm attachments are in the right format and open, not that they are 100% accurate copies of the data.
"I guess once you start doubting, there's no end to it." -Batou, Ghost in the Shell: Stand Alone Complex
(Score: 2) by looorg on Friday August 26 2016, @11:12AM
I'm always amazed for all the different things people use Excel for. It's a great program for making some simple spreadsheets and stuff but it becomes horrible after that as far as I am concerned. Pages filled with formulas and generally all around bad programming of referring to frames/cells and doing things to them and then just spitting out the data someplace else. That said the program has clearly gotten functions for it. So the problem isn't so much the program but the people.
That said having a proper statistics software package isn't exactly any guarantee for proper stats usage. The most common fault I can think of it typing/scaling your data - everything just gets put down as being ratio or interval data even tho it's really nominal data. The program doesn't mind. The program doesn't care. The program doesn't know what your data is. It will happily compute it and do what you ask. But just what is the mean of male and female? or the average month of the year ... It's not the software package that makes or breaks the data - it's the idiot behind the keyboard. Users are just horribly sloppy with all their data and the processing of said data.
(Score: 2, Informative) by FatPhil on Friday August 26 2016, @11:19AM
Maybe there needs to be an assistant in Excel that would detect that what you're typing in might be better represented as a database, and pop up a little help balloon to inform you of this thinko?
Great minds discuss ideas; average minds discuss events; small minds discuss people; the smallest discuss themselves
(Score: 4, Insightful) by WillR on Friday August 26 2016, @01:27PM
"spreadsheet easy, database hard"
IME it's more "spreadsheet installed by default, database requires weeks of fighting with Mordac the IT preventer"
(Score: 2) by FatPhil on Friday August 26 2016, @01:38PM
For the last 10 years I've worked at companies where they give you a laptop and say "install whatever OS you like, just make sure you're good at adminning it". (Not true, the current one has said "chose whichever laptop you like, and then ...".) I appreciate that puts me in a very tiny minority.
Great minds discuss ideas; average minds discuss events; small minds discuss people; the smallest discuss themselves
(Score: 2) by Arik on Friday August 26 2016, @02:45PM
Most people don't stop at the beginning of the project and really think about what tool would be best for it. They grab whatever they are familiar with and start hacking. So we have millions of people using spreadsheets to keep their databases and then complaining when the results are less than stellar.
If laughter is the best medicine, who are the best doctors?
(Score: 2) by FatPhil on Friday August 26 2016, @04:40PM
I also think it tries to display a date as 0.545454545 rather than 6/11.
Sturgeon was an optimist.
Great minds discuss ideas; average minds discuss events; small minds discuss people; the smallest discuss themselves
(Score: -1, Offtopic) by Anonymous Coward on Friday August 26 2016, @11:24AM
ttp://bbs.progrider.org/prog/read/1471949120
http://bbs.progrider.org/lounge/read/1471946796 [progrider.org]
Try some libre vidya
Tired of being tired, but don't want to de-bed?
Try something new (to you):
Info:
https://lgdb.org/game/chaosesque-anthology [lgdb.org] [lgdb.org]
http://www.moddb.com/games/chaosesqueanthology [moddb.com] [moddb.com]
https://sourceforge.net/projects/chaosesqueanthology/ [sourceforge.net]
Direct download:
http://downloads.sourceforge.net/project/chaosesqueanthology/Rel_93/XonoticChaosEsqueAnthologyREL93.iso?r=&ts=1471771440&use_mirror=liquidtelecom [sourceforge.net]
(Score: 0) by Anonymous Coward on Friday August 26 2016, @08:23PM
Parent post is more of his serial abuse. [soylentnews.org]
Soylentils: In order to call it to the attention of TPTB, **don't** mod it Offtopic. [archive.li]
(The S/N comment engine still strips out instances of %22 in URLs.) 8-(
...and to the site staffers:
Just how many instances of repeated abuse do there have to be before an AC's IP address is banned?
-- OriginalOwner_ [soylentnews.org]
(Score: 2) by janrinok on Saturday August 27 2016, @09:12AM
I would hope that we try not to censor anyone - just browse at level 0 or 1 and such things just disappear because the community has already down-modded them. But, we can ban if we have to.
I only have to reboot my router in a certain way and my ISP will allocate me a new IP - banning a specific IP is of limited value, and banning a block brings its own heap of problems. I'm sure that paulej72 will have a whole list of good reasons why we operate as we do, but that is not my part of ship, so to speak...
(Score: 0) by Anonymous Coward on Friday August 26 2016, @11:53AM
I used it yesterday to edit a CSV file and it started shifting data into neighboring cells for no reason. No, I didn't report the bug. Fix your POS software, Microsoft.
(Score: 0) by Anonymous Coward on Friday August 26 2016, @01:57PM
Probably line breaks and tabs in your data it is interpreting literally instead of cell data.
(Score: 1) by three_sheets on Friday August 26 2016, @05:10PM
No line breaks. Maybe tabs. It was saved as CSV from an XLS file.
(Score: 2) by EvilSS on Friday August 26 2016, @05:27PM
Or commas. I run into this all the time, where a program offers to save data into a CSV, where commas are in the data. Open it up and you end up with random cell shifts.
(Score: 0) by Anonymous Coward on Friday August 26 2016, @09:04PM
Yup. My jazz station uses MICROS~1's app (and inappropriate separator characters) to drive their Most Recently Played list [kkjz.org], so
Artist
Lambert, Hendricks, Ross
Becomes
Artist | Tune | Disk
Lambert | Hendricks | Ross
.
Not directly related, but amusing to me:
On their Most Recently Played list, my local Rock station (must be owned by Mormons or something) does funny things with the name of a Rolling Stones song. [archive.li]
Elton John's "The Bitch Is Back" gets a similar treatment as does the name of the band Moby Dick.
-- OriginalOwner_ [soylentnews.org]
(Score: 0) by Anonymous Coward on Friday August 26 2016, @12:18PM
It would seem to me that anybody going to the trouble of writing a scientific article would go to the trouble of checking their work.
Excel is both a terrible and wonderful tool.
Blaming it is just an excuse for being lazy.
(Score: 0) by Anonymous Coward on Friday August 26 2016, @02:06PM
What hilarious and/or cringe-worthy 'corrections' have YOU seen?
Different localisations of spreadsheet programs use point or commas for decimals. Entering data in one version does not guarantee it will be properly read by the spreadsheet of a colleague (university, with colleagues having laptops in their own localisation). Even importing generated data from devices are sometimes tricky in this respect.
(Score: 2) by jdccdevel on Friday August 26 2016, @05:43PM
Saw this recently on a bearing manufacturer's website...
The ratings for the bearing were specified in both Pounds Force and Newtons. Whoever did the conversion from Newtons to Pounds used the wrong localization settings, using the comma as a decimal separator instead of a thousands marker.
It was really funny seeing a bearing rated for something like 53000 newtons, but only 12 Pounds!!!
(Score: 1) by three_sheets on Friday August 26 2016, @05:19PM
20 % refers to those papers with Excel data, not all papers. But it goes to show that Excel is not an appropriate tool for science. Every time I see an obvious Excel-generated plot, I cringe. Scientists should be smarter than that and have more pride in their work. To paraphrase: If the data is worth publishing it's worth analyzing well.
(Score: 0) by Anonymous Coward on Friday August 26 2016, @09:26PM
More specifically, schools that teach Science should put an emphasis on their students becoming proficient in a FOSS language (such as "R") intended to do these tasks.
...then again, M$ seems to be attempting to do an E,E,E on that.
They "R" Coming: More Microsoft Money for the Linux Foundation [techrights.org] July 2015
Introduction to Microsoft’s Growing Role in the R Project Through the Linux Foundation and E.E.E. [techrights.org] January 2016
-- OriginalOwner_ [soylentnews.org]
(Score: 2) by MichaelDavidCrawford on Friday August 26 2016, @11:40PM
A reproducible bug involving linked spreadsheets led Working Software to overdraw its checking account by $4,000.00.
Yes I Have No Bananas. [gofundme.com]
(Score: 2) by MichaelDavidCrawford on Friday August 26 2016, @11:45PM
In Word 5 for the Macintosh, if you spellchecked the word "childcare" it was reported as misspelled with "kidnaper" suggested as the replacement.
Yes I Have No Bananas. [gofundme.com]