from the dwarf-gold-fever-infecting-standards dept.
A group of Cambridge (UK) computer scientists have set a new gold standard for openness and reproducibility in research by sharing the more than 200 GB of data and 20,000 lines of code behind their latest results - an unprecedented degree of openness in a peer-reviewed publication. The researchers hope that this new gold standard will be adopted by other fields, increasing the reliability of research results, especially for work which is publicly funded.
The researchers are presenting their results at a talk today at the 12th USENIX Symposium on Networked Systems Design and Implementation (NSDI) in Oakland, California.
In recent years there's been a great deal of discussion about so-called 'open access' publications - the idea that research publications, particularly those funded by public money, should be made publicly available.
Computer science has embraced open access more than many disciplines, with some publishers sub-licensing publications and allowing authors to publish them in open archives. However, as more and more corporations publish their research in academic journals, and as academics find themselves in a 'publish or perish' culture, the reliability of research results has come into question.
http://phys.org/news/2015-05-gold-standard.html
[Also Covered By]: http://www.eurekalert.org/pub_releases/2015-05/uoc-ngs043015.php
[Source]: http://www.cam.ac.uk/research/news/new-gold-standard-established-for-open-and-reproducible-research
(Score: 3, Interesting) by zeigerpuppy on Tuesday May 05 2015, @01:09AM
there are some excellent resources for reproducible research;
particularly this book: https://www.crcpress.com/product/isbn/9781466572843 [crcpress.com]
(Score: 3, Insightful) by bradley13 on Tuesday May 05 2015, @06:18AM
I don't get it either. Back when I was active in research, I published my code and data online, for anyone to download. IIRC my entire research group at UT Austin did the same. Science is all about reproducible results; in computer science, reproducing results requires access to the underlying code and data.
Publishing your code and data has been voluntary up to now. The only thing that needs to change: all serious journals need to require full disclosure and put the links right in the articles. Referees should check that the disclosed information is available and complete.
One problem remains: people move around a lot, and leave a trail of data tied to old accounts that are eventually deleted, or at least no longer maintained. For published articles, a copy of the code and data should be hosted by the conference or by the publishers, so that it remains available long-term. Hey, the journals would actually be providing a tangible service, in return for their crazy prices!
Everyone is somebody else's weirdo.
(Score: 0) by Anonymous Coward on Tuesday May 05 2015, @01:19AM
Liars. You're all liars. All of you liars.
(Score: 0) by Anonymous Coward on Tuesday May 05 2015, @01:23AM
Reliable until they find the critical bug in those 20k lines of code :!
(Score: 0) by Anonymous Coward on Tuesday May 05 2015, @05:55AM
That is exactly what it means.
(Score: 2) by Non Sequor on Tuesday May 05 2015, @01:54AM
Analysis of reproducibility and data sharing in scientific studies.
(For all I know that's the title. I didn't see an obvious link to the actual paper and the press release published by phys.org and Eurekalert doesn't mention the topic of the work. I have a suspicion that the university press relations staff asked the researchers to say something interesting about their work and what they came up with was that they released more data and code than any paper they could remember off the tops of their heads.)
Write your congressman. Tell him he sucks.
(Score: 2) by frojack on Tuesday May 05 2015, @02:59AM
The bottom link eventually states that they were studying data center efficiency.
All in all, I'm not impressed. All that data is probably non transferable to any other data center, any code written is probably spread sheets (and 20,000 lines isn't much anyway). So to me its not clear that any actual science was done, but even if there was some, the data dump is useless to anyone else.
I question the value of all that data, software, not only in this case, but in most fields of study. The point of re-doing any study, is seldom to repeat it step by step hoping to find something different, or chance upon an error. 100 repetitions of a flawed protocol are not useful. Just about nobody is going to use any software developed from one study. (i've seen some scientist's software, and it wasn't pretty).
Rather, you want to confirm the outcome, or refute the findings. If you develop your own study, and you confirm the result, that confirmation is much more valuable than repeating the prior study exactly, right down to using their software. If your results don't match then you can look for differences, or bugs in your design, or their design.
Clearly you can't repeat every study, due to the costs. In those cases, forensic examination of their data and their procedures, and even their software might be warranted.
But surely not in an efficiency study of one data center.
No, you are mistaken. I've always had this sig.
(Score: 2) by kaszz on Tuesday May 05 2015, @02:47AM
How have they solved the distribution of 200 GB of data?
(Score: 0) by Anonymous Coward on Tuesday May 05 2015, @03:03AM
We have this thing called the Internet.
(Score: 2) by Gravis on Tuesday May 05 2015, @03:15AM
no, somebody else did: bittorrent.
(Score: 2) by frojack on Tuesday May 05 2015, @03:44AM
So, about 25 full lenght 1080P HD movies worth of useless data? Seriously, who the hell is going to re-seed that?
No, you are mistaken. I've always had this sig.
(Score: 3, Informative) by WhiteSpade on Tuesday May 05 2015, @08:24AM
I work in a neuroscience research lab, and we distribute our data (~400GB worth) using quite a few different methods [studyforrest.org], including git-annex [branchable.com]. Others in my lab are working to take git-annex one step further with datalad [github.com] (terrible name, I know) to make it easier for scientists to discover, use, manage, and share their data.
On the non-distribution side, we publish all our code, all our data, hell even the paper [f1000research.com] for the above data is open source on GitHub [github.com]. The makefile generates the paper with all figures.
Our tools are all open source. Any tools we write (such as pymvpa [pymvpa.org]) are released as open source, and we do all our collection and analysis on Neurodebian [debian.net], where we package and distributes neuroscience software for Debian/*buntu. You can say we believe in open science. ;-)
I'm glad to see others publishing like this. It is unfortunately all too rare, but it's slowly becoming more and more common.
---Alex