Stories
Slash Boxes
Comments

SoylentNews is people

posted by cmn32480 on Friday June 19 2015, @06:47AM   Printer-friendly
from the big-data-little-analysis dept.

Dramatic increases in data science education coupled with robust evidence-based data analysis practices could stop the scientific research reproducibility and replication crisis before the issue permanently damages science's credibility, asserts Roger D. Peng in an article in the newly released issue of Significance magazine.

"Much the same way that epidemiologist John Snow helped end a London cholera epidemic by convincing officials to remove the handle of an infected water pump, we have an opportunity to attack the crisis of scientific reproducibility at its source," wrote Peng, who is associate professor of biostatistics at the Johns Hopkins Bloomberg School of Public Health.

In his article titled "The Reproducibility Crisis in Science"—published in the June issue of Significance, a statistics-focused, public-oriented magazine published jointly by the American Statistical Association (ASA) and Royal Statistical Society—Peng attributes the crisis to the explosion in the amount of data available to researchers and their comparative lack of analytical skills necessary to find meaning in the data.

"Data follow us everywhere, and analyzing them has become essential for all kinds of decision-making. Yet, while our ability to generate data has grown dramatically, our ability to understand them has not developed at the same rate," he wrote.

This analytics shortcoming has led to some significant "public failings of reproducibility," as Peng describes them, across a range of scientific disciplines, including cancer genomics, clinical medicine and economics.

The original article came from phys.org.

[Related]: Big Data - Overload


Original Submission

 
This discussion has been archived. No new comments can be posted.
Display Options Threshold/Breakthrough Mark All as Read Mark All as Unread
The Fine Print: The following comments are owned by whoever posted them. We are not responsible for them in any way.
  • (Score: 0) by Anonymous Coward on Friday June 19 2015, @08:21AM

    by Anonymous Coward on Friday June 19 2015, @08:21AM (#198158)

    The alternate hypothesis is the one you want to check.

    Why? The null hypothesis is the specific one. The alternative is so vague as to be useless...It could mean anything. Once again this was dealt with long ago by Paul Meehl:

    If I tell you that Meehl’s theory of climate predicts that it will rain sometime next April, and this turns out to be the case, you will not be much impressed with my “predictive success.” Nor will you be impressed if I predict more rain in April than in May, even showing three asterisks (for p [less than] .001) in my t-test table! If I predict from my theory that it will rain on 7 of the 30 days of April, and it rains on exactly 7, you might perk up your ears a bit, but still you would be inclined to think of this as a “lucky coincidence.” But suppose that I specify which 7 days in April it will rain and ring the bell; then you will start getting seriously interested in Meehl’s meteorological conject-ures. Finally, if I tell you that on April 4th it will rain 1.7 inches (.66 cm), and on April 9th, 2.3 inches (.90 cm) and so forth, and get seven of these correct within reasonable tolerance, you will begin to think that Meehl’s theory must have a lot going for it. You may believe that Meehl’s theory of the weather, like all theories, is, when taken literally, false, since probably all theories are false in the eyes of God, but you will at least say, to use Popper’s language, that it is beginning to look as if Meehl’s theory has considerable verisimili-tude, that is, “truth-likeness.”

    Theoretical Risks and Tabular Asterisks: Sir Karl, Sir Ronald, and the Slow Progress of Soft Psychology. Journal of Consulting and Clinical Psychology
    1978, Vol. 46, 806-834.
    http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.200.7648&rep=rep1&type=pdf [psu.edu]

  • (Score: 0) by Anonymous Coward on Friday June 19 2015, @10:44AM

    by Anonymous Coward on Friday June 19 2015, @10:44AM (#198182)

    Why?

    If you didn't know why, then there would be no point in checking it, would there?

    I mean, nobody comes up with hypotheses by throwing dice or something. Or at least if someone does, I'd see that as good reason not to trust him.

    But of course how to come up with that hypothesis is not a subject of a statistics class. So as far as statistics is concerned that hypothesis is arbitrary.

    It's just like Newtonian mechanics will happily let you calculate the trajectory of a moon at any given position and speed. Whether it makes sense to calculate that trajectory is not a question of Newtonian mechanics. As far as Newtonian mechanics is concerned, that position and speed is arbitrary.

    Of course in practice you'll not insert arbitrary values, but values of the real moon, or values of a satellite that you want to put there. But that's not the business of Newtonian mechanics, that's the business of whatever field uses Newtonian mechanics for its calculations. An astronomer will be interested in the orbit of an existing object, a space agency may be more interested in the possible orbits of a satellite, and an artillery expert couldn't care less about orbits, but will want to calculate trajectories of projectiles on earth. All of them use Newtonian mechanics, but for each of them, what values make sense to put into it and what values don't is different.

    With statistics, it's the same. Statistics doesn't care where your hypotheses or your data come from. If a certain hypothesis makes sense or not is not a matter of statistics, it is a matter of whatever field the statistics is applied to. Asking statistics to answer that question is wrong, and therefore asking for that question to be answered in a statistics course is wrong.

    And as of your quote: If the theory tells me for every month of several years exactly the number of days it rains in that month, I'll certainly take the theory very serious even if it cannot tell me the exact dates of the rain.

    Yes, telling for just one month is not convincing. But that's not because it's statistics, but because it's insufficient data.

    • (Score: 0) by Anonymous Coward on Friday June 19 2015, @02:57PM

      by Anonymous Coward on Friday June 19 2015, @02:57PM (#198260)

      I meant: Why is the alternative hypothesis the one I am interested in. Why not set the null hypothesis to that? This would make much more sense, but I was taught to do the opposite.

      • (Score: 0) by Anonymous Coward on Friday June 19 2015, @09:34PM

        by Anonymous Coward on Friday June 19 2015, @09:34PM (#198441)

        I meant: Why is the alternative hypothesis the one I am interested in.

        Because its the one you want to test.

        Why not set the null hypothesis to that?

        Because then it wouldn't be a "null" hypothesis. The "null" hypothesis is the hypothesis that the variable or whatever it is you're looking at has no effect, while the "alternate" hypothesis (the other hypothesis, the non-null one) is the hypothesis that the variable or whatever you're looking at does have an effect.

        Your teacher must have been terrible if you don't even understand what a null hypothesis is supposed to be.

        • (Score: 0) by Anonymous Coward on Friday June 19 2015, @10:03PM

          by Anonymous Coward on Friday June 19 2015, @10:03PM (#198451)

          The null hypothesis is the hypothesis to be nullified. It can be anything. In practice it has come to be usually "no effect, no correlation" (this is called the "nil null" hypothesis) and that is the problem. The null hypothesis is nearly always false in that case, the only time it isn't is if you are looking for differences regarding something that does not exist (eg ESP). It is easily proved that two groups of people did not come from the same hypothetical infinite distribution. The infinite hypothetical distribution is not real, therefore the two groups could not have been sampled from it.

          If you mess up the experiment, data analysis, or data entry, it will exaggerate this. If there are any differences at baseline, it will exaggerate the magnitude of the deviation from the nil null hypothesis. If there are non-causative correlations in play, it will exaggerate it. It is false to begin with and literally anything else will magnify that. Rejecting the nil null hypothesis contains no useful information. If you think it does, you are confused. This idea has caused mass confusion, just as Ronald Fisher predicted it would. And he is partly to blame, for popularizing the p-value!

          "We are quite in danger of sending highly trained and highly intelligent young men out into the world with tables of erroneous numbers under their arms, and with a dense fog in the place where their brains ought to be. In this century, of course, they will be working on guided missiles and advising the medical profession on the control of disease, and there is no limit to the extent to which they could impede every sort of national effort."

            Fisher, R N (1958). "The Nature of Probability" (PDF). Centennial Review 2: 261–274. http://www.york.ac.uk/depts/maths/histstat/fisher272.pdf [york.ac.uk]

          This guy who came up with this nil null idea is likely the same one who invented the ACT. It appears he was writing an introductory textbook to stats for educators in the late 1930s and got confused between two different approaches to statistics (Neyman-Pearson's and Fisher's). I recommend anyone interested in science put some effort into learning about this.