Stories
Slash Boxes
Comments

SoylentNews is people

posted by martyb on Wednesday July 26 2017, @10:39AM   Printer-friendly
from the probably-a-good-idea dept.

Statistician Valen Johnson and 71 other researchers have proposed a redefinition of statistical significance in order to cut down on irreproducible results, especially those in the biomedical sciences. They propose "to change the default P-value threshold for statistical significance for claims of new discoveries from 0.05 to 0.005" in a preprint article that will be published in an upcoming issue of Nature Human Behavior:

A megateam of reproducibility-minded scientists is renewing a controversial proposal to raise the standard for statistical significance in research studies. They want researchers to dump the long-standing use of a probability value (p-value) of less than 0.05 as the gold standard for significant results, and replace it with the much stiffer p-value threshold of 0.005.

Backers of the change, which has been floated before, say it could dramatically reduce the reporting of false-positive results—studies that claim to find an effect when there is none—and so make more studies reproducible. And they note that researchers in some fields, including genome analysis, have already made a similar switch with beneficial results.

"If we're going to be in a world where the research community expects some strict cutoff ... it's better that that threshold be .005 than .05. That's an improvement over the status quo," says behavioral economist Daniel Benjamin of the University of Southern California in Los Angeles, first author on the new paper, which was posted 22 July as a preprint article [open, DOI: 10.17605/OSF.IO/MKY9J] [DX] on PsyArXiv and is slated for an upcoming issue of Nature Human Behavior. "It seemed like this was something that was doable and easy, and had worked in other fields."

But other scientists reject the idea of any absolute threshold for significance. And some biomedical researchers worry the approach could needlessly drive up the costs of drug trials. "I can't be very enthusiastic about it," says biostatistician Stephen Senn of the Luxembourg Institute of Health in Strassen. "I don't think they've really worked out the practical implications of what they're talking about."

They have proposed a P-value of 0.005 because it corresponds to Bayes factors between approximately 14 and 26 in favor of H1 (the alternative hypothesis), indicating "substantial" to "strong" evidence, and because it would reduce the false positive rate to levels they have judged to be reasonable "in many fields".

Is this good enough? Is it a good start?

OSF project page. If you have trouble downloading the PDF, use this link.


Original Submission

Related Stories

Justify Your Alpha: A Response to "Redefine Statistical Significance" 29 comments

Psychologist Daniël Lakens disagrees with a proposal to redefine statistical significance to require a 0.005 p-value, and has crowdsourced an alternative set of recommendations with 87 co-authors:

Psychologist Daniël Lakens of Eindhoven University of Technology in the Netherlands is known for speaking his mind, and after he read an article titled "Redefine Statistical Significance" on 22 July 2017, Lakens didn't pull any punches: "Very disappointed such a large group of smart people would give such horribly bad advice," he tweeted.

In the paper, posted on the preprint server PsyArXiv, 70 prominent scientists argued in favor of lowering a widely used threshold for statistical significance in experimental studies: The so-called p-value should be below 0.005 instead of the accepted 0.05, as a way to reduce the rate of false positive findings and improve the reproducibility of science. Lakens, 37, thought it was a disastrous idea. A lower α, or significance level, would require much bigger sample sizes, making many studies impossible. Besides. he says, "Why prescribe a single p-value, when science is so diverse?"

Lakens and others will soon publish their own paper to propose an alternative; it was accepted on Monday by Nature Human Behaviour, which published the original paper proposing a lower threshold in September 2017. The content won't come as a big surprise—a preprint has been up on PsyArXiv for 4 months—but the paper is unique for the way it came about: from 100 scientists around the world, from big names to Ph.D. students, and even a few nonacademics writing and editing in a Google document for 2 months.

Lakens says he wanted to make the initiative as democratic as possible: "I just allowed anyone who wanted to join and did not approach any famous scientists."


Original Submission

This discussion has been archived. No new comments can be posted.
Display Options Threshold/Breakthrough Mark All as Read Mark All as Unread
The Fine Print: The following comments are owned by whoever posted them. We are not responsible for them in any way.
(1)
  • (Score: 2, Touché) by Anonymous Coward on Wednesday July 26 2017, @10:48AM (4 children)

    by Anonymous Coward on Wednesday July 26 2017, @10:48AM (#544576)

    In software development we call these things Magic Numbers. Since when has science devolved into numerology? Can't we rather teach statistics, rather toying with the Magic Number?

    In the first place, why are you using p-values? Or is that the only statistical hammer in the bio-medical toolbox? What is your population? What is your sample size? Is it randomised? What is the distribution? How accurate is your instrumentation? Do you have a control? Is it double blind?

    Increasing the p-value? What about false negatives?

    Here have a clue stick.

    • (Score: -1, Offtopic) by Anonymous Coward on Wednesday July 26 2017, @11:20AM

      by Anonymous Coward on Wednesday July 26 2017, @11:20AM (#544586)

      Here have a clue stick.

      May I have a glue one? Unlike the pee-value, I know at least what a glue stick is good for.

    • (Score: 5, Insightful) by melikamp on Wednesday July 26 2017, @05:44PM (2 children)

      by melikamp (1886) on Wednesday July 26 2017, @05:44PM (#544760) Journal

      In software development we call these things Magic Numbers.

      Indeeedy. Let's take TFA apart, shall we. Here are some things (most) authors admit are not addressed in their proposal:

      The proposal does not address multiple hypothesis testing, P-hacking, publication bias, low power, or other biases (e.g., confounding, selective reporting, measurement error), which are arguably the bigger problems.

      And more:

      Changing the significance threshold is a distraction from the real solution, which is to replace null hypothesis significance testing (and bright-line thresholds) with more focus on effect sizes and confidence intervals, treating the P-value as a continuous measure, and/or a Bayesian method.

      Emphasis courtesy of yours truly.

      I believe these are all legitimate concerns which must be voiced alongside a proposal like that. If they didn't acknowledge them, I'd be simply laughing TFA out of the room. Still, I believe the authors are turning a blind eye to a giant fallacy which is being rammed through. It's not that the proposal will do nothing to help with publication bias, it will quite probably make it worse.

      The authors do seem to disagree on a range of issues, but they do have one thing in common: they all drink out of the bullshit fountain known as the frequentist interpretation. And it's not even that they share in the belief (at least some of the authors seem to come from the Bayesian camp), but they all seem to be OK with letting frequentists off the hook.

      Failing to reject the null hypothesis does not mean accepting the null hypothesis.

      Indeed, when one is a strong believer in frequentist dogma, it doesn't. Here's an example, for those who are not familiar with the issue at hand. Suppose we have two drugs on the shelf, X and Y. Both drugs are treating the same condition, say, acne. Both drugs went through similar clinical efficacy trials, and were found about equal in effect. However, the drug X, unlike the drug Y, also went through another comprehensive study, which controlled for dozens of variables, and yielded a confidence interval for an increase in blood pressure. So let's say this latter study produced a 99.9% confidence interval (which is better than 99.5% threshold they are proposing) covering the zero (meaning no change in blood pressure), and ended with P-value of 0.5 and a frequentist conclusion "the data does not provide sufficient evidence to conclude that X increases blood pressure". According to the authors, this result should fail the "golden standard of significance", and I will discuss what that means shortly. For now, imagine yourself having high blood pressure problems. You are standing in the drug store, choosing between X and Y. Of course if one drug doesn't work, you will try another. But are you going to choose X or Y first? And if you say X, like any normal person who understands statistics would, you just tacitly admitted that the frequentist interpretation is utter bullshit. You can't keep claiming with a straight face the result was statistically insignificant if it more or less tied your hands with respect to your actions in the drug store. It had a huge sample size, variable control up the wazoo, and an amazing confidence level by drug standards. This should be considered as statistically significant as it gets.

      So let's bring this back and see how the reality stacks up with TFA's proposal. What was the goal, anyway?

      The lack of reproducibility of scientific studies has caused growing concern over the credibility of claims of new discoveries based on "statistically significant" findings.

      Is that what we are trying to fix? Non-reproducibility? What does the P-value even have to do with that? If we want fewer duds, that is, studies yielding either type 1 or type 2 error, we just need to increase sample sizes, like the authors themselves say, by roughly 70%. We can make 0.005 a gold standard of the significance level, for example. Then statisticians will have a choice: either (1) bump the sample size and produce the same amount of more significant (and hence harder to knock down) null rejections, or (2) leave the sample size the same and produce fewer null rejections and more "inconclusive" studies. Either way, what they usually regard as reproducibility of a "demonstrated effect" will go up. The only "downside" of this approach is that more studies will fail to reject the null, and as we know, these kinds of studies are three times harder to publish [wikipedia.org], which brings us to the last twist of this rant.

      The P-value fetish, IMHO, will backfire bigly, since the root of the problem is not in the exact value of the "magic number" 0.05, like that it's too high. The root of the evil is that the frequentist sect keeps calling some null-rejecting results "statistically significant", while all others are thrown into the "inconclusive" heap, and this nomenclature, IMHO, is one of the major factors driving the publication bias.

      The "golden standard" for the P-value, if we need to set it at all, should be 1.0. Every study is statistically significant, how can it not be? What exactly prevents us from "accepting the null" besides a certain almost-religious conviction? Why can't we call an "inconclusive study" such as the XY example above "statistically significant"? We sure as hell act as if it is statistically significant when our health and our money are on the line.

      Obviously some studies are more significant than others, and we can argue till the Sun burns out on what metrics we should use to compare the relative "statistical significance", or how we can score the "statistical significance" on a continuous scale. Regardless, passing a value judgment on a study based simply on where the confidence interval has landed accomplishes nothing or worse.

      And while I do not for a moment suspect the authors to act with an ulterior motive, I can't help to think all these professional researches are bonding together simply because they want a publication environment with higher stakes than now. It is ludicrous not to notice the relationship between regarding some studies as "significant" and others as "suggestive", both pure value judgments, and the chances of a study being published, and yet the authors do not seem to care about that. Think about what their proposal is actually likely to do:

      (1) It will become harder (more expensive) to create a "significant" result, making publication easier for big name researchers affiliated with big name places.

      (2) It will become harder (more risky) to knock down a "detected effect", thus discouraging reproduction of studies. Of course, not bothering to reproduce will likely decrease the number of discredited results, but then we could as well just never try to reproduce anything. The authors tout 50% reduction in false positives: a curious presumption for the world where the pressure to reject will go up, the publication bias will increase, and few will ever dare to repeat a study. Again, big shots benefit by virtue of being able to churn out studies which may well be total bunk, but too expensive to duplicate, saving them from embarasement.

      (3) Nothing at all will be done to discourage people to regard null rejection as better than non-rejection (which should rightly be called "confirmation", if not "acceptance"). The frequentist interpretation is bunk, and it's time to stop taking it for granted or to regard it as legitimate. The new P-value threshold of 0.005, just like any other magic number besides 1.0, will simply provide more validation to the garbage philosophy.

      • (Score: 0) by Anonymous Coward on Wednesday July 26 2017, @06:03PM

        by Anonymous Coward on Wednesday July 26 2017, @06:03PM (#544768)

        The "golden standard" for the P-value, if we need to set it at all, should be 1.0.

        As someone who believes NHST is the most destructive meme science has ever encountered (far more dangerous than any type of religious objection to science), I actually love this idea.

      • (Score: 2) by cafebabe on Wednesday July 26 2017, @09:57PM

        by cafebabe (894) on Wednesday July 26 2017, @09:57PM (#544885) Journal

        I considered the possibility that they might be trolling with an absurd reduction for the purpose of getting bad practice disbanded. Now I'm concerned that established players are entrenching bad practice for their own benefit. I believe there was a Mark Twain quote of the form "There was never a gathering of professionals for the betterment of mankind."

        --
        Enjoy life. Enjoy Ainol. [wikipedia.org]
  • (Score: 3, Informative) by cafebabe on Wednesday July 26 2017, @11:27AM (11 children)

    by cafebabe (894) on Wednesday July 26 2017, @11:27AM (#544588) Journal

    There is an O(n^2) problem in many statistical studies. Automation means that an increasing number of variables can be collected and Big Data allows a retrospective trawl through more variables. This data is then exhaustively cross-corollated and then, dammit, *something* of significance is found because the mantra is "publish or perish". Having any generally agreed threshold of significance will simultaneously discourage original research and encourage meta-studies and suchlike. Raising the threshold discourages new avenues of research and increases noise.

    --
    Enjoy life. Enjoy Ainol. [wikipedia.org]
    • (Score: 3, Insightful) by Virindi on Wednesday July 26 2017, @11:44AM (5 children)

      by Virindi (3484) on Wednesday July 26 2017, @11:44AM (#544598)

      And worse, working with preexisting data makes it very tempting to fit your hypothesis to the data. Same with the general disdain for "our hypothesis was disproved" papers.

      That's the real problem that needs to be addressed and changing p-values does not directly address it. The probability of SOME pattern appearing in random noise is high, and people are picking their theory to fit that pattern. Then they are using statistical methods based on a "formulate hypothesis"->"gather data"->"check against hypothesis" model. Big Data is the worst for this, it seems.

      • (Score: 2) by FakeBeldin on Wednesday July 26 2017, @12:26PM (3 children)

        by FakeBeldin (3360) on Wednesday July 26 2017, @12:26PM (#544605) Journal

        The model you quote seems apt.
        My worry is that nowadays, it seems more often it seems to be:

        gather data -> formulate hypothesis -> investigate data -> adapt hypothesis to investigation -> check hypothesis -> publish

        Validation sets are too often used to formulate the hypothesis.

        • (Score: 1) by Virindi on Wednesday July 26 2017, @02:19PM

          by Virindi (3484) on Wednesday July 26 2017, @02:19PM (#544644)

          Yep that's what I was saying :)

          It's lazy mode.

          Then of course there is the whole other category of "models which we can't properly test so we just rely on care and the authors being at a good institution", which is a similar problem.

        • (Score: 2) by cafebabe on Wednesday July 26 2017, @02:41PM (1 child)

          by cafebabe (894) on Wednesday July 26 2017, @02:41PM (#544653) Journal

          It would be an improvement if multiple theories were proposed and theories which didn't fit were discarded. This may appear less honed but tweaking a hypothesis prior to publication is akin to one of Rudyard Kipling's Just So Stories. Science should have predictive power and be falsifiable. If there is nothing to predict and nothing to falsify then it isn't science.

          --
          Enjoy life. Enjoy Ainol. [wikipedia.org]
          • (Score: 0) by Anonymous Coward on Wednesday July 26 2017, @03:58PM

            by Anonymous Coward on Wednesday July 26 2017, @03:58PM (#544694)

            It would be an improvement if multiple theories were proposed and theories which didn't fit were discarded.

            Improvement? Without that you have no science.

      • (Score: 2) by maxwell demon on Wednesday July 26 2017, @07:07PM

        by maxwell demon (1608) Subscriber Badge on Wednesday July 26 2017, @07:07PM (#544804) Journal

        And worse, working with preexisting data makes it very tempting to fit your hypothesis to the data. Same with the general disdain for "our hypothesis was disproved" papers.

        On the other hand, you do want some means against "we invent a wild hypothesis just in order to promptly disprove it". You don't want articles like:

        Watching Doctor Who does not cause broken legs

        Are you more likely to break your leg if you regularly watch Doctor Who? By comparing the number of fractures from watchers of Doctor Who versus watchers of Star Trek or Babylon 5 showed no correlations. The comparison between Star Trek and Babylon 5 watchers is inconclusive; more research is required.

        --
        The Tao of math: The numbers you can count are not the real numbers.
    • (Score: 0) by Anonymous Coward on Wednesday July 26 2017, @02:26PM (3 children)

      by Anonymous Coward on Wednesday July 26 2017, @02:26PM (#544647)

      Not really, increasing the amount of data that you're crunching doesn't guarantee better results if the data you're crunching is based on crap data. For example, it doesn't really matter how many shoe sizes you've collected if you're trying to determine the kind of paintings somebody likes. The two things are effectively completely dissimilar and as such, you're not going to get a meaningful result. It gets even worse when you start combining more and more things.

      The data sciences are getting to be a cargo cult where companies keep collecting more and more data hoping to figure out what to do with it, but not paying attention to other issues like contamination.

      Raising the threshold reduces the noise because it means you need a stronger correlation before something is reported on. Yes, it does somewhat discourage new research, but let's be honest about the way that people have used new research to justify all sorts of things only to find out that it was a fluke or a mistake. You can still do new research, the problem is like the replication experiments, it's not sexy, so it can be a challenge to get funding for it, even though it's a terribly important part of the experiment.

      The other thing this does is somewhat slow the speed of advancement as we need to be more sure than with the current recommended value. But, let's be honest, for the most part we're at a point where we can afford to slow research down in order to get results that are an order of magnitude more reliable. What we can't particularly afford is to have a bunch of unreliable science that we're not even sure if it's right.

      • (Score: 2) by cafebabe on Wednesday July 26 2017, @03:31PM (2 children)

        by cafebabe (894) on Wednesday July 26 2017, @03:31PM (#544675) Journal

        The two things are effectively completely dissimilar and as such, you're not going to get a meaningful result. It gets even worse when you start combining more and more things.

        Someone may have to correct my figures but, as I understand, accuracy is proportional to the square root of the number of samples. So, doubling sample quality requires quadrupling the number of samples. (Workload increases by a factor of four to gain one additional bit of accuracy.) To improve accuracy by a factor of 10 requires more than three quadruplings of sample data. With far less effort (and cost), it is easier to collect more variables. Cross-corollation may be completely random but opportunities to find a pattern are O(n^2). If any corollation meets an arbitrary standard then it is a positive result to publish even if it cannot be replicated.

        --
        Enjoy life. Enjoy Ainol. [wikipedia.org]
        • (Score: 0) by Anonymous Coward on Wednesday July 26 2017, @03:40PM

          by Anonymous Coward on Wednesday July 26 2017, @03:40PM (#544681)

          You're talking about precision, not accuracy.

          If your input data are, for some reason, skewed to give a misleading result, then a larger data set will not improve accuracy. You will, with greater precision, zero in on your skewed answer.

        • (Score: 0) by Anonymous Coward on Wednesday July 26 2017, @05:43PM

          by Anonymous Coward on Wednesday July 26 2017, @05:43PM (#544758)

          If you have biased sampling, data that's not applicable or just weird data adding more won't help.

          You have to have a decent model and decent data to have any hope of making a meaningful conclusion. The Stanford Prison Experiment never replicated because they randomly found more psychopaths than normal. The study was fine, but adding more data points would only help if they weren't selecting from a population with an abnormal number of psychopaths. Otherwise they'd get the same results with more decimal places.

    • (Score: 0) by Anonymous Coward on Thursday July 27 2017, @07:25AM

      by Anonymous Coward on Thursday July 27 2017, @07:25AM (#545035)

      One thing you're ignoring. The birthday paradox completely disappears once you attach a date to the matter, which is what the research would culminate in. And a low p-value makes it increasingly apparent when less than ethical researchers are just retrofitting science to data. A lower p threshold not only makes it more likely that the produced data is valid, but makes it easier to weed out bad apples who are not abiding the scientific process.

      Taken to extremes if you had a p threshold of 0.000001 or whatever then it would be almost a certainty that whatever's published is completely ubiquitous, OR that the individual publishing said data went from data to hypothesis rather than vice versa. You're completely right that this discourages 'original research' but what that translates to in reality is that it discourages people from seeking to confirm their biases with correlations. This is an enormous problem in the social sciences. To a lesser degree it's even a problem in genetic research (and medicine) which are still stuck in the realm of correlations which fail with a healthy degree of regularity. Reducing the viability of this sort of research is a good thing.

  • (Score: 2) by RamiK on Wednesday July 26 2017, @11:38AM (3 children)

    by RamiK (1813) on Wednesday July 26 2017, @11:38AM (#544594)

    If anything, it will only make things worse as seeking cures instead of treatments will become even harder to financially justify. Moreover, it will shift the risk-profit equation for hiding negative findings and inflating positive results towards the wrong direction: Already, failing an experiment at human trials means over $20million down the drain. But by making things more expensive by requiring 10 times the sample size, the risks for lying will stay the same while the potential loses would only increase.

    Overall, the only reasonable solutions I've heard so far were requiring the disclosure of research funding and the results of all failed experiments regardless of NDAs.

    --
    compiling...
    • (Score: 1, Informative) by Anonymous Coward on Wednesday July 26 2017, @02:32PM

      by Anonymous Coward on Wednesday July 26 2017, @02:32PM (#544650)

      I don't see a problem with that at all. Vioxx alone was approved and wound up costing the company that produced it billions of dollars because of the various law suits. Same goes for Johnson & Johnson's talcum powder products.

      Companies somehow find the money for the research, might as well make them improve their standards to the point where they work. Ultimately, calling it failure when a trial results in a different result than you expected is ignorant. That's not a failure, that's science functioning as it's supposed to. Failure is taking that data and repackaging it for use as a new paper rather than as a new hypothesis to make predictions based on and test.

      Medical research is crap research for other reasons. Changing from 5% to 0.5% isn't going to make it that much harder to conduct studies. Medical research is going to be crap because there are few longitudinal studies that look at the effects of drugs and treatments over the long term and it's considered unethical to withhold treatments that are believed effective. The result is that our understanding of whether or not treatments work and if they're even safe is extremely limited.

    • (Score: 0) by Anonymous Coward on Thursday July 27 2017, @07:41AM (1 child)

      by Anonymous Coward on Thursday July 27 2017, @07:41AM (#545044)

      THIS IS THE WHOLE POINT. By the time you get to human testing if you're not extremely confident you can hit a 99.5% threshold, then you shouldn't be testing that product.

      I'm fine with it encouraging falsification. The nice thing is that the tighter the threshold for significance becomes, the more evident any falsification becomes which can result in actions against companies. E.g. imagine the threshold for significance was 0.000000001. It's reasonable safe to say that anything hat hits that that threshold is either genuine, or the 'scientists' behind it faked their data. The current 95% leaves a lot of room for plausible deniability.

      In my opinion professional medical research companies are not a great thing. Human longevity has been largely improved by relatively simple things like better hygenic habits, food cleanliness, and then some medicines that could hit pretty much any threshold of significance like penicillin, anesthetics, and certain high reliability low-side effect vaccines like smallpox. And then there's perhaps the biggest reason for the increase in longevity - peace. War kills people not just in war, but in the disruption of civil order with things like food production and distribution. We live in an era of unprecedented peace, relative to the past.

      • (Score: 0) by Anonymous Coward on Thursday July 27 2017, @08:12AM

        by Anonymous Coward on Thursday July 27 2017, @08:12AM (#545064)

        E.g. imagine the threshold for significance was 0.000000001. It's reasonable safe to say that anything hat hits that that threshold is either genuine, or the 'scientists' behind it faked their data.

        The problem isn't whether they detected something "genuine", it is whether they are detecting something anyone should care about. For example you can download a database of p-values from here: https://github.com/jtleek/tidypvals [github.com] then sort the database by pvalue:

        require(tidypvals)

        allp = allp[order(allp$pvalue),]
        allp = allp[allp$pvalue > 0,]

        head(allp, 20)

        Here is the top hit, it is a totally meaningless p-value because it "detects" that if you plug different info into different equations they won't give you exactly the same answer:

        Table 4 shows the McFadden’s and McKelvey and Zavoina pseudo-r2 values for the empty and full models with and without distance to TSS for suggestively and significantly trait-associated SNPs. The logistic regression model without the distance to TSS for the significantly associated SNPs explained 11-25% of the observed variance, which was an increase of 4-11% when compared to the empty model, which only included the effects of the genotyping arrays. An ANOVA test, using a chi-squared test, showed the difference between the two models to be significant (Deviance = 1501.00, P-value = 3.13 × 10^-309).

        https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3600003/ [nih.gov]

        BTW, that p-value (3.13e-309) is much smaller than your 1e-9 (~12k p-values smaller than that are in the database). The vast majority of p-values are meaningless like this, there is no reason to care at all about what their value is.

  • (Score: 2) by FakeBeldin on Wednesday July 26 2017, @12:31PM (3 children)

    by FakeBeldin (3360) on Wednesday July 26 2017, @12:31PM (#544606) Journal

    (There Ain't No Such Thing As A Free Lunch.)

    If it was as cheap and easy to get p 0.005 as it is to get p 0.05, researchers would be doing it already.
    Getting this level of certainty requires more of something. More tests, more subjects, more time... more.
    That means that smaller, simpler studies will become unpublishable.

    How about applying what the first AC in this post basically remarked:
    - use confidence metrics appropriate to the study,
    - judge significance as appropriate for the study and the used confidence metrics.

    • (Score: 0, Disagree) by Anonymous Coward on Wednesday July 26 2017, @03:07PM (1 child)

      by Anonymous Coward on Wednesday July 26 2017, @03:07PM (#544665)

      Smaller, simpler studies should be unpublishable. By definition, smaller and simpler = less accurate.

      • (Score: 0) by Anonymous Coward on Thursday July 27 2017, @12:05PM

        by Anonymous Coward on Thursday July 27 2017, @12:05PM (#545115)

        I'm going to make your mod rating a little bit smaller.

    • (Score: 0) by Anonymous Coward on Wednesday July 26 2017, @07:55PM

      by Anonymous Coward on Wednesday July 26 2017, @07:55PM (#544828)

      As I understand p-values, 0.05 means a 5% chance that you are publishing rubbish.

      So if we make the assumption that only 1% of hypothesis are actually correct, and that only verified results are published (null hypothesis doesn't get published), this means that only 1 out of 6 studies we see published is actually a real result.

      This is a bigger issue than small studies being pushed out.

  • (Score: 2) by looorg on Wednesday July 26 2017, @01:27PM (2 children)

    by looorg (578) on Wednesday July 26 2017, @01:27PM (#544617)

    If this gets accepted as the new standard I'm afraid that the new methodology will just include a lot more data-massaging or p-value-hacking. Gathering, (re-)defining and manipulating your data to reach the golden standard of a "proper" result at a p-value of 0,005. After all it will be cheaper then doing actual proper studies. I gather they will just make pre-studies to find out the limits and distribution of the sample and then for the real thing they'll just to stuff as many data-points as possible into the "good" limits/range as they possibly can, all other data-points will be discarded as irrelevant to the study and not counted in the total number of observations. This will naturally be different from field to field. It might make more sense for drug- and medical research then other fields that use the same p-standard.
    I do wonder if this will have any kind of impact on the pamphlet they include in the little box of medicine. The side effect parts is some really scarey reading; sure I'd like some massive headaches, potential strokes and a high risk of anal leakage with my pills. Wait wasn't this what the pill was supposed to cure?

    Then there is the question of what do you do with all the old "bad research", if a p of 0,005 is the new standard will it invalidate all previous research at p 0,05? Invalidate in the sense of will it be usable, trusted and quoted? Nobody is going to redo all of it to find out. The data for older research might not even exist anymore in any relevant form. If anything I wish they would just then take the opportunity to move away from this kind of thinking or p-validation all together where things are written and then just concluded with a p 0,0x and because of that it must all be significant, true and great.

    • (Score: 1) by Virindi on Wednesday July 26 2017, @02:22PM (1 child)

      by Virindi (3484) on Wednesday July 26 2017, @02:22PM (#544646)

      If only the full raw data was included with every paper published...

      Or at least archived somewhere where anyone can get it. But no, raw data has to be locked away and forgotten. We only care about results!

      • (Score: 0) by Anonymous Coward on Thursday July 27 2017, @08:35AM

        by Anonymous Coward on Thursday July 27 2017, @08:35AM (#545072)

        If you're seeking your own funding, there's no time or reward for looking at someone else's data. Ditto for checking reproducibility. You need NOVEL SHINY WHIZZBANG to get grants. The university does not care about your results or your science. Bring in the grant money because the Dean of Science and the Associate Dean of Ethics in Science and the Sub-Dean of Scientific Regulation and the Associate Principal of Science and Society need your 54% overheads pay for their offices and staff.

  • (Score: 2) by moondrake on Wednesday July 26 2017, @02:27PM (5 children)

    by moondrake (2658) on Wednesday July 26 2017, @02:27PM (#544648)

    I have only skimmed over the paper so far, but it seems completely unworkable to me. I was surprised to see so many biologists on the author list. Any biologist with some decent understanding of statistics understands that it is simply not feasible to get that amount of significant for many real effects. I also think that statistics sometimes makes too many idealistic assumptions for biological, i.e. real, populations and our ability to take samples from that.

    Let me give a simple example of the kind of problems we are already dealing with (with the very lenient p0.05): Growth rate is often dependent on some metabolism. You can measure the speed of this metabolism by varying methods. Now we give an inhibitor to this metabolism, but at such low concentrations that although the effect was 20%, the inhibition is too small compared to the natural variation between individuals to be significant. Yet, after 30 days of exponential growth with a non-significant difference in metabolism, the treated individuals are significantly and 200% different in size.

    But wait, you say: lets just measure more individuals to decrease our standard error. But unfortunately, for many real experiments, doing it with that many samples would mean introducing all kinds of problems (cannot be with the same material, on the same day, by the same person, or with the same organisms). You can work around some of these. But not for all (the numbers needed are staggering). Instead, you try to do different experiments, all pointing to the same thing, and discuss that it is likely that an effect exists, even although your p is 0.1, well screw that.

    And sometimes, where it is possible to use a ridiculous amount of data, you end up with all kinds of things that are significantly different (a 0.0000001% difference can be significant, given enough data), without actually being relevant.

    I see more benefit in making people understand what a p-value means, or by talking about likelihood ratios then to force the field into a definition of significance where nothing of interest is significant anymore.

    • (Score: 0) by Anonymous Coward on Wednesday July 26 2017, @04:01PM (4 children)

      by Anonymous Coward on Wednesday July 26 2017, @04:01PM (#544696)

      Its bizarre that you seem to understand that 'significance" is a function of sample size, but still think there is some point to determining it. All it measures is how much effort you are willing/able to put forth to collect data in support of your idea (ie it measures the strength of prior belief about whether an "effect" is positive/negative). The entire thing is pointless.

      • (Score: 2) by moondrake on Wednesday July 26 2017, @07:07PM (3 children)

        by moondrake (2658) on Wednesday July 26 2017, @07:07PM (#544803)

        Is it bizarre that I think that? Is it not exactly what the people in this paper are proposing?

        I agree that there is no point to it, but you won't be a very successful scientist with that attitude.

        • (Score: 0) by Anonymous Coward on Wednesday July 26 2017, @10:49PM (2 children)

          by Anonymous Coward on Wednesday July 26 2017, @10:49PM (#544906)

          Yes, it is bizarre that the authors think this as well. In reading their paper though, it sounds like many of them actually don't want NHST around either. The figure maybe this will cut down on BS somehow by making it slightly harder for those who don't know what they are doing.

          Also, I quit medical research literally for this reason. It was too depressing and pointless. They said it was too complicated, but if you do a good job (come up with a "real" model for what is going on and test that), they just ask if the group comparison was statistically significant anyway. It is a waste of your career to do medical research right now.

          • (Score: 0) by Anonymous Coward on Thursday July 27 2017, @08:43AM (1 child)

            by Anonymous Coward on Thursday July 27 2017, @08:43AM (#545076)

            You're not wrong.

            One disturbing thing I am noticing is tons of shitty shitty studies that weakly confirm older studies but don't cite them. Instead they cite 40 articles from 2015 and later written by their fellow countrymen. It feels like a "yellow" washing of science. The number of publications and citations swamp the literature and pad resumes. Publication count was always slightly dodgy but citation count used to be slightly reliable. Now citation counts are becoming garbage - because this new generation of "patriotic" scientists cite their own countrymen at 10x the rate of others.

            There's barely anything worth reading and yet more and more of it being published.

            • (Score: 0) by Anonymous Coward on Thursday July 27 2017, @03:55PM

              by Anonymous Coward on Thursday July 27 2017, @03:55PM (#545231)

              Meh, I've seen the same BS from all cultures. Some are just more sophisticated about producing the junk than others due to steps they memorized in school.

  • (Score: 0) by Anonymous Coward on Wednesday July 26 2017, @03:41PM

    by Anonymous Coward on Wednesday July 26 2017, @03:41PM (#544683)

    Yes but what is the P value of the pool?

  • (Score: 0) by Anonymous Coward on Wednesday July 26 2017, @04:14PM

    by Anonymous Coward on Wednesday July 26 2017, @04:14PM (#544701)

    They still don't understand what a p-value means. All that is going on here is the vague collective opinion that only a certain percent of experiments should be "successful", so that it seems like something productive has been done.

    1) More money is going towards research X
    2) Researchers in X field can collect more data
    3) Expected value of p is lowered (p is determined by sample size)
    4) Alpha (the "significance" cutoff) lowered to match new expected value of p

    Fields with more data have lower alpha (eg particle physics alpha = 3e-7), fields with less data have higher alpha (eg rare disease they use 0.1). This is adjusted over time (with some lag) so that the "right amount" of studies get significance.

  • (Score: 0) by Anonymous Coward on Wednesday July 26 2017, @04:47PM

    by Anonymous Coward on Wednesday July 26 2017, @04:47PM (#544729)
  • (Score: 0) by Anonymous Coward on Wednesday July 26 2017, @04:48PM

    by Anonymous Coward on Wednesday July 26 2017, @04:48PM (#544732)

    OSF project page. If you have trouble downloading the PDF, use this link.

    When I went to the OSF page I got a warning:

    You are about to log in to the site "sentry.cos.io" with the username "14a8f28b817b4c21bb535ff68c7b5828", but the website does not require authentication. This may be an attempt to trick you.

    Is "sentry.cos.io" the site you want to visit?

    Yes, this is due to the too much javascript on that page. Actually I am not a fan of the design of this site at all, it looks like generic corporate design: https://osf.io/ [osf.io]

(1)