Stories
Slash Boxes
Comments

SoylentNews is people

posted by martyb on Wednesday July 26 2017, @10:39AM   Printer-friendly
from the probably-a-good-idea dept.

Statistician Valen Johnson and 71 other researchers have proposed a redefinition of statistical significance in order to cut down on irreproducible results, especially those in the biomedical sciences. They propose "to change the default P-value threshold for statistical significance for claims of new discoveries from 0.05 to 0.005" in a preprint article that will be published in an upcoming issue of Nature Human Behavior:

A megateam of reproducibility-minded scientists is renewing a controversial proposal to raise the standard for statistical significance in research studies. They want researchers to dump the long-standing use of a probability value (p-value) of less than 0.05 as the gold standard for significant results, and replace it with the much stiffer p-value threshold of 0.005.

Backers of the change, which has been floated before, say it could dramatically reduce the reporting of false-positive results—studies that claim to find an effect when there is none—and so make more studies reproducible. And they note that researchers in some fields, including genome analysis, have already made a similar switch with beneficial results.

"If we're going to be in a world where the research community expects some strict cutoff ... it's better that that threshold be .005 than .05. That's an improvement over the status quo," says behavioral economist Daniel Benjamin of the University of Southern California in Los Angeles, first author on the new paper, which was posted 22 July as a preprint article [open, DOI: 10.17605/OSF.IO/MKY9J] [DX] on PsyArXiv and is slated for an upcoming issue of Nature Human Behavior. "It seemed like this was something that was doable and easy, and had worked in other fields."

But other scientists reject the idea of any absolute threshold for significance. And some biomedical researchers worry the approach could needlessly drive up the costs of drug trials. "I can't be very enthusiastic about it," says biostatistician Stephen Senn of the Luxembourg Institute of Health in Strassen. "I don't think they've really worked out the practical implications of what they're talking about."

They have proposed a P-value of 0.005 because it corresponds to Bayes factors between approximately 14 and 26 in favor of H1 (the alternative hypothesis), indicating "substantial" to "strong" evidence, and because it would reduce the false positive rate to levels they have judged to be reasonable "in many fields".

Is this good enough? Is it a good start?

OSF project page. If you have trouble downloading the PDF, use this link.


Original Submission

 
This discussion has been archived. No new comments can be posted.
Display Options Threshold/Breakthrough Mark All as Read Mark All as Unread
The Fine Print: The following comments are owned by whoever posted them. We are not responsible for them in any way.
  • (Score: 3, Informative) by cafebabe on Wednesday July 26 2017, @11:27AM (11 children)

    by cafebabe (894) on Wednesday July 26 2017, @11:27AM (#544588) Journal

    There is an O(n^2) problem in many statistical studies. Automation means that an increasing number of variables can be collected and Big Data allows a retrospective trawl through more variables. This data is then exhaustively cross-corollated and then, dammit, *something* of significance is found because the mantra is "publish or perish". Having any generally agreed threshold of significance will simultaneously discourage original research and encourage meta-studies and suchlike. Raising the threshold discourages new avenues of research and increases noise.

    --
    1702845791×2
    Starting Score:    1  point
    Moderation   +1  
       Informative=1, Total=1
    Extra 'Informative' Modifier   0  
    Karma-Bonus Modifier   +1  

    Total Score:   3  
  • (Score: 3, Insightful) by Virindi on Wednesday July 26 2017, @11:44AM (5 children)

    by Virindi (3484) on Wednesday July 26 2017, @11:44AM (#544598)

    And worse, working with preexisting data makes it very tempting to fit your hypothesis to the data. Same with the general disdain for "our hypothesis was disproved" papers.

    That's the real problem that needs to be addressed and changing p-values does not directly address it. The probability of SOME pattern appearing in random noise is high, and people are picking their theory to fit that pattern. Then they are using statistical methods based on a "formulate hypothesis"->"gather data"->"check against hypothesis" model. Big Data is the worst for this, it seems.

    • (Score: 2) by FakeBeldin on Wednesday July 26 2017, @12:26PM (3 children)

      by FakeBeldin (3360) on Wednesday July 26 2017, @12:26PM (#544605) Journal

      The model you quote seems apt.
      My worry is that nowadays, it seems more often it seems to be:

      gather data -> formulate hypothesis -> investigate data -> adapt hypothesis to investigation -> check hypothesis -> publish

      Validation sets are too often used to formulate the hypothesis.

      • (Score: 1) by Virindi on Wednesday July 26 2017, @02:19PM

        by Virindi (3484) on Wednesday July 26 2017, @02:19PM (#544644)

        Yep that's what I was saying :)

        It's lazy mode.

        Then of course there is the whole other category of "models which we can't properly test so we just rely on care and the authors being at a good institution", which is a similar problem.

      • (Score: 2) by cafebabe on Wednesday July 26 2017, @02:41PM (1 child)

        by cafebabe (894) on Wednesday July 26 2017, @02:41PM (#544653) Journal

        It would be an improvement if multiple theories were proposed and theories which didn't fit were discarded. This may appear less honed but tweaking a hypothesis prior to publication is akin to one of Rudyard Kipling's Just So Stories. Science should have predictive power and be falsifiable. If there is nothing to predict and nothing to falsify then it isn't science.

        --
        1702845791×2
        • (Score: 0) by Anonymous Coward on Wednesday July 26 2017, @03:58PM

          by Anonymous Coward on Wednesday July 26 2017, @03:58PM (#544694)

          It would be an improvement if multiple theories were proposed and theories which didn't fit were discarded.

          Improvement? Without that you have no science.

    • (Score: 2) by maxwell demon on Wednesday July 26 2017, @07:07PM

      by maxwell demon (1608) on Wednesday July 26 2017, @07:07PM (#544804) Journal

      And worse, working with preexisting data makes it very tempting to fit your hypothesis to the data. Same with the general disdain for "our hypothesis was disproved" papers.

      On the other hand, you do want some means against "we invent a wild hypothesis just in order to promptly disprove it". You don't want articles like:

      Watching Doctor Who does not cause broken legs

      Are you more likely to break your leg if you regularly watch Doctor Who? By comparing the number of fractures from watchers of Doctor Who versus watchers of Star Trek or Babylon 5 showed no correlations. The comparison between Star Trek and Babylon 5 watchers is inconclusive; more research is required.

      --
      The Tao of math: The numbers you can count are not the real numbers.
  • (Score: 0) by Anonymous Coward on Wednesday July 26 2017, @02:26PM (3 children)

    by Anonymous Coward on Wednesday July 26 2017, @02:26PM (#544647)

    Not really, increasing the amount of data that you're crunching doesn't guarantee better results if the data you're crunching is based on crap data. For example, it doesn't really matter how many shoe sizes you've collected if you're trying to determine the kind of paintings somebody likes. The two things are effectively completely dissimilar and as such, you're not going to get a meaningful result. It gets even worse when you start combining more and more things.

    The data sciences are getting to be a cargo cult where companies keep collecting more and more data hoping to figure out what to do with it, but not paying attention to other issues like contamination.

    Raising the threshold reduces the noise because it means you need a stronger correlation before something is reported on. Yes, it does somewhat discourage new research, but let's be honest about the way that people have used new research to justify all sorts of things only to find out that it was a fluke or a mistake. You can still do new research, the problem is like the replication experiments, it's not sexy, so it can be a challenge to get funding for it, even though it's a terribly important part of the experiment.

    The other thing this does is somewhat slow the speed of advancement as we need to be more sure than with the current recommended value. But, let's be honest, for the most part we're at a point where we can afford to slow research down in order to get results that are an order of magnitude more reliable. What we can't particularly afford is to have a bunch of unreliable science that we're not even sure if it's right.

    • (Score: 2) by cafebabe on Wednesday July 26 2017, @03:31PM (2 children)

      by cafebabe (894) on Wednesday July 26 2017, @03:31PM (#544675) Journal

      The two things are effectively completely dissimilar and as such, you're not going to get a meaningful result. It gets even worse when you start combining more and more things.

      Someone may have to correct my figures but, as I understand, accuracy is proportional to the square root of the number of samples. So, doubling sample quality requires quadrupling the number of samples. (Workload increases by a factor of four to gain one additional bit of accuracy.) To improve accuracy by a factor of 10 requires more than three quadruplings of sample data. With far less effort (and cost), it is easier to collect more variables. Cross-corollation may be completely random but opportunities to find a pattern are O(n^2). If any corollation meets an arbitrary standard then it is a positive result to publish even if it cannot be replicated.

      --
      1702845791×2
      • (Score: 0) by Anonymous Coward on Wednesday July 26 2017, @03:40PM

        by Anonymous Coward on Wednesday July 26 2017, @03:40PM (#544681)

        You're talking about precision, not accuracy.

        If your input data are, for some reason, skewed to give a misleading result, then a larger data set will not improve accuracy. You will, with greater precision, zero in on your skewed answer.

      • (Score: 0) by Anonymous Coward on Wednesday July 26 2017, @05:43PM

        by Anonymous Coward on Wednesday July 26 2017, @05:43PM (#544758)

        If you have biased sampling, data that's not applicable or just weird data adding more won't help.

        You have to have a decent model and decent data to have any hope of making a meaningful conclusion. The Stanford Prison Experiment never replicated because they randomly found more psychopaths than normal. The study was fine, but adding more data points would only help if they weren't selecting from a population with an abnormal number of psychopaths. Otherwise they'd get the same results with more decimal places.

  • (Score: 0) by Anonymous Coward on Thursday July 27 2017, @07:25AM

    by Anonymous Coward on Thursday July 27 2017, @07:25AM (#545035)

    One thing you're ignoring. The birthday paradox completely disappears once you attach a date to the matter, which is what the research would culminate in. And a low p-value makes it increasingly apparent when less than ethical researchers are just retrofitting science to data. A lower p threshold not only makes it more likely that the produced data is valid, but makes it easier to weed out bad apples who are not abiding the scientific process.

    Taken to extremes if you had a p threshold of 0.000001 or whatever then it would be almost a certainty that whatever's published is completely ubiquitous, OR that the individual publishing said data went from data to hypothesis rather than vice versa. You're completely right that this discourages 'original research' but what that translates to in reality is that it discourages people from seeking to confirm their biases with correlations. This is an enormous problem in the social sciences. To a lesser degree it's even a problem in genetic research (and medicine) which are still stuck in the realm of correlations which fail with a healthy degree of regularity. Reducing the viability of this sort of research is a good thing.