Stories
Slash Boxes
Comments

SoylentNews is people

posted by martyb on Wednesday July 26 2017, @10:39AM   Printer-friendly
from the probably-a-good-idea dept.

Statistician Valen Johnson and 71 other researchers have proposed a redefinition of statistical significance in order to cut down on irreproducible results, especially those in the biomedical sciences. They propose "to change the default P-value threshold for statistical significance for claims of new discoveries from 0.05 to 0.005" in a preprint article that will be published in an upcoming issue of Nature Human Behavior:

A megateam of reproducibility-minded scientists is renewing a controversial proposal to raise the standard for statistical significance in research studies. They want researchers to dump the long-standing use of a probability value (p-value) of less than 0.05 as the gold standard for significant results, and replace it with the much stiffer p-value threshold of 0.005.

Backers of the change, which has been floated before, say it could dramatically reduce the reporting of false-positive results—studies that claim to find an effect when there is none—and so make more studies reproducible. And they note that researchers in some fields, including genome analysis, have already made a similar switch with beneficial results.

"If we're going to be in a world where the research community expects some strict cutoff ... it's better that that threshold be .005 than .05. That's an improvement over the status quo," says behavioral economist Daniel Benjamin of the University of Southern California in Los Angeles, first author on the new paper, which was posted 22 July as a preprint article [open, DOI: 10.17605/OSF.IO/MKY9J] [DX] on PsyArXiv and is slated for an upcoming issue of Nature Human Behavior. "It seemed like this was something that was doable and easy, and had worked in other fields."

But other scientists reject the idea of any absolute threshold for significance. And some biomedical researchers worry the approach could needlessly drive up the costs of drug trials. "I can't be very enthusiastic about it," says biostatistician Stephen Senn of the Luxembourg Institute of Health in Strassen. "I don't think they've really worked out the practical implications of what they're talking about."

They have proposed a P-value of 0.005 because it corresponds to Bayes factors between approximately 14 and 26 in favor of H1 (the alternative hypothesis), indicating "substantial" to "strong" evidence, and because it would reduce the false positive rate to levels they have judged to be reasonable "in many fields".

Is this good enough? Is it a good start?

OSF project page. If you have trouble downloading the PDF, use this link.


Original Submission

 
This discussion has been archived. No new comments can be posted.
Display Options Threshold/Breakthrough Mark All as Read Mark All as Unread
The Fine Print: The following comments are owned by whoever posted them. We are not responsible for them in any way.
  • (Score: 5, Insightful) by melikamp on Wednesday July 26 2017, @05:44PM (2 children)

    by melikamp (1886) on Wednesday July 26 2017, @05:44PM (#544760) Journal

    In software development we call these things Magic Numbers.

    Indeeedy. Let's take TFA apart, shall we. Here are some things (most) authors admit are not addressed in their proposal:

    The proposal does not address multiple hypothesis testing, P-hacking, publication bias, low power, or other biases (e.g., confounding, selective reporting, measurement error), which are arguably the bigger problems.

    And more:

    Changing the significance threshold is a distraction from the real solution, which is to replace null hypothesis significance testing (and bright-line thresholds) with more focus on effect sizes and confidence intervals, treating the P-value as a continuous measure, and/or a Bayesian method.

    Emphasis courtesy of yours truly.

    I believe these are all legitimate concerns which must be voiced alongside a proposal like that. If they didn't acknowledge them, I'd be simply laughing TFA out of the room. Still, I believe the authors are turning a blind eye to a giant fallacy which is being rammed through. It's not that the proposal will do nothing to help with publication bias, it will quite probably make it worse.

    The authors do seem to disagree on a range of issues, but they do have one thing in common: they all drink out of the bullshit fountain known as the frequentist interpretation. And it's not even that they share in the belief (at least some of the authors seem to come from the Bayesian camp), but they all seem to be OK with letting frequentists off the hook.

    Failing to reject the null hypothesis does not mean accepting the null hypothesis.

    Indeed, when one is a strong believer in frequentist dogma, it doesn't. Here's an example, for those who are not familiar with the issue at hand. Suppose we have two drugs on the shelf, X and Y. Both drugs are treating the same condition, say, acne. Both drugs went through similar clinical efficacy trials, and were found about equal in effect. However, the drug X, unlike the drug Y, also went through another comprehensive study, which controlled for dozens of variables, and yielded a confidence interval for an increase in blood pressure. So let's say this latter study produced a 99.9% confidence interval (which is better than 99.5% threshold they are proposing) covering the zero (meaning no change in blood pressure), and ended with P-value of 0.5 and a frequentist conclusion "the data does not provide sufficient evidence to conclude that X increases blood pressure". According to the authors, this result should fail the "golden standard of significance", and I will discuss what that means shortly. For now, imagine yourself having high blood pressure problems. You are standing in the drug store, choosing between X and Y. Of course if one drug doesn't work, you will try another. But are you going to choose X or Y first? And if you say X, like any normal person who understands statistics would, you just tacitly admitted that the frequentist interpretation is utter bullshit. You can't keep claiming with a straight face the result was statistically insignificant if it more or less tied your hands with respect to your actions in the drug store. It had a huge sample size, variable control up the wazoo, and an amazing confidence level by drug standards. This should be considered as statistically significant as it gets.

    So let's bring this back and see how the reality stacks up with TFA's proposal. What was the goal, anyway?

    The lack of reproducibility of scientific studies has caused growing concern over the credibility of claims of new discoveries based on "statistically significant" findings.

    Is that what we are trying to fix? Non-reproducibility? What does the P-value even have to do with that? If we want fewer duds, that is, studies yielding either type 1 or type 2 error, we just need to increase sample sizes, like the authors themselves say, by roughly 70%. We can make 0.005 a gold standard of the significance level, for example. Then statisticians will have a choice: either (1) bump the sample size and produce the same amount of more significant (and hence harder to knock down) null rejections, or (2) leave the sample size the same and produce fewer null rejections and more "inconclusive" studies. Either way, what they usually regard as reproducibility of a "demonstrated effect" will go up. The only "downside" of this approach is that more studies will fail to reject the null, and as we know, these kinds of studies are three times harder to publish [wikipedia.org], which brings us to the last twist of this rant.

    The P-value fetish, IMHO, will backfire bigly, since the root of the problem is not in the exact value of the "magic number" 0.05, like that it's too high. The root of the evil is that the frequentist sect keeps calling some null-rejecting results "statistically significant", while all others are thrown into the "inconclusive" heap, and this nomenclature, IMHO, is one of the major factors driving the publication bias.

    The "golden standard" for the P-value, if we need to set it at all, should be 1.0. Every study is statistically significant, how can it not be? What exactly prevents us from "accepting the null" besides a certain almost-religious conviction? Why can't we call an "inconclusive study" such as the XY example above "statistically significant"? We sure as hell act as if it is statistically significant when our health and our money are on the line.

    Obviously some studies are more significant than others, and we can argue till the Sun burns out on what metrics we should use to compare the relative "statistical significance", or how we can score the "statistical significance" on a continuous scale. Regardless, passing a value judgment on a study based simply on where the confidence interval has landed accomplishes nothing or worse.

    And while I do not for a moment suspect the authors to act with an ulterior motive, I can't help to think all these professional researches are bonding together simply because they want a publication environment with higher stakes than now. It is ludicrous not to notice the relationship between regarding some studies as "significant" and others as "suggestive", both pure value judgments, and the chances of a study being published, and yet the authors do not seem to care about that. Think about what their proposal is actually likely to do:

    (1) It will become harder (more expensive) to create a "significant" result, making publication easier for big name researchers affiliated with big name places.

    (2) It will become harder (more risky) to knock down a "detected effect", thus discouraging reproduction of studies. Of course, not bothering to reproduce will likely decrease the number of discredited results, but then we could as well just never try to reproduce anything. The authors tout 50% reduction in false positives: a curious presumption for the world where the pressure to reject will go up, the publication bias will increase, and few will ever dare to repeat a study. Again, big shots benefit by virtue of being able to churn out studies which may well be total bunk, but too expensive to duplicate, saving them from embarasement.

    (3) Nothing at all will be done to discourage people to regard null rejection as better than non-rejection (which should rightly be called "confirmation", if not "acceptance"). The frequentist interpretation is bunk, and it's time to stop taking it for granted or to regard it as legitimate. The new P-value threshold of 0.005, just like any other magic number besides 1.0, will simply provide more validation to the garbage philosophy.

    Starting Score:    1  point
    Moderation   +4  
       Insightful=3, Interesting=1, Total=4
    Extra 'Insightful' Modifier   0  
    Karma-Bonus Modifier   +1  

    Total Score:   5  
  • (Score: 0) by Anonymous Coward on Wednesday July 26 2017, @06:03PM

    by Anonymous Coward on Wednesday July 26 2017, @06:03PM (#544768)

    The "golden standard" for the P-value, if we need to set it at all, should be 1.0.

    As someone who believes NHST is the most destructive meme science has ever encountered (far more dangerous than any type of religious objection to science), I actually love this idea.

  • (Score: 2) by cafebabe on Wednesday July 26 2017, @09:57PM

    by cafebabe (894) on Wednesday July 26 2017, @09:57PM (#544885) Journal

    I considered the possibility that they might be trolling with an absurd reduction for the purpose of getting bad practice disbanded. Now I'm concerned that established players are entrenching bad practice for their own benefit. I believe there was a Mark Twain quote of the form "There was never a gathering of professionals for the betterment of mankind."

    --
    1702845791×2