Psychologist Daniël Lakens disagrees with a proposal to redefine statistical significance to require a 0.005 p-value, and has crowdsourced an alternative set of recommendations with 87 co-authors:
Psychologist Daniël Lakens of Eindhoven University of Technology in the Netherlands is known for speaking his mind, and after he read an article titled "Redefine Statistical Significance" on 22 July 2017, Lakens didn't pull any punches: "Very disappointed such a large group of smart people would give such horribly bad advice," he tweeted.
In the paper, posted on the preprint server PsyArXiv, 70 prominent scientists argued in favor of lowering a widely used threshold for statistical significance in experimental studies: The so-called p-value should be below 0.005 instead of the accepted 0.05, as a way to reduce the rate of false positive findings and improve the reproducibility of science. Lakens, 37, thought it was a disastrous idea. A lower α, or significance level, would require much bigger sample sizes, making many studies impossible. Besides. he says, "Why prescribe a single p-value, when science is so diverse?"
Lakens and others will soon publish their own paper to propose an alternative; it was accepted on Monday by Nature Human Behaviour, which published the original paper proposing a lower threshold in September 2017. The content won't come as a big surprise—a preprint has been up on PsyArXiv for 4 months—but the paper is unique for the way it came about: from 100 scientists around the world, from big names to Ph.D. students, and even a few nonacademics writing and editing in a Google document for 2 months.
Lakens says he wanted to make the initiative as democratic as possible: "I just allowed anyone who wanted to join and did not approach any famous scientists."
Statistician Valen Johnson and 71 other researchers have proposed a redefinition of statistical significance in order to cut down on irreproducible results, especially those in the biomedical sciences. They propose "to change the default P-value threshold for statistical significance for claims of new discoveries from 0.05 to 0.005" in a preprint article that will be published in an upcoming issue of Nature Human Behavior:
A megateam of reproducibility-minded scientists is renewing a controversial proposal to raise the standard for statistical significance in research studies. They want researchers to dump the long-standing use of a probability value (p-value) of less than 0.05 as the gold standard for significant results, and replace it with the much stiffer p-value threshold of 0.005.
Backers of the change, which has been floated before, say it could dramatically reduce the reporting of false-positive results—studies that claim to find an effect when there is none—and so make more studies reproducible. And they note that researchers in some fields, including genome analysis, have already made a similar switch with beneficial results.
"If we're going to be in a world where the research community expects some strict cutoff ... it's better that that threshold be .005 than .05. That's an improvement over the status quo," says behavioral economist Daniel Benjamin of the University of Southern California in Los Angeles, first author on the new paper, which was posted 22 July as a preprint article [open, DOI: 10.17605/OSF.IO/MKY9J] [DX] on PsyArXiv and is slated for an upcoming issue of Nature Human Behavior. "It seemed like this was something that was doable and easy, and had worked in other fields."
But other scientists reject the idea of any absolute threshold for significance. And some biomedical researchers worry the approach could needlessly drive up the costs of drug trials. "I can't be very enthusiastic about it," says biostatistician Stephen Senn of the Luxembourg Institute of Health in Strassen. "I don't think they've really worked out the practical implications of what they're talking about."
They have proposed a P-value of 0.005 because it corresponds to Bayes factors between approximately 14 and 26 in favor of H1 (the alternative hypothesis), indicating "substantial" to "strong" evidence, and because it would reduce the false positive rate to levels they have judged to be reasonable "in many fields".
Is this good enough? Is it a good start?
OSF project page. If you have trouble downloading the PDF, use this link.
(Score: 1, Informative) by Anonymous Coward on Sunday January 28, @09:08PM (3 children)
Stats are super delicate beast. But if you had the quantitative mind to work proper stats, you wouldn't be doing psychs/social science in the first place, would you.
Reply to This
(Score: 2) by MichaelDavidCrawford on Monday January 29, @12:05AM (2 children)
Without a doubt psych and social will one day be quantitative
Consider the origin of chemistry
Reply to This
Parent
(Score: 0) by Anonymous Coward on Monday January 29, @12:23AM (1 child)
Origin of chemistry is cooking. What's your point?
Reply to This
Parent
(Score: 2) by MichaelDavidCrawford on Monday January 29, @01:21AM
Reply to This
Parent
(Score: 3, Interesting) by Anonymous Coward on Sunday January 28, @09:57PM
I've seen them all, PhD students, Post-Docs, P.I.s and Professors failing at statistics... When asked about it you often get the answer "but researcher x is also doing it like that". The same here... the p-value isn't that important. Statistics is a tool, it is a process to analyse your data, to get to know it, see where the errors are in your data and see if you could apply methods to get around those flaws. The statistical tests are done during that analysis, not a result of it. They, at first, have to convince YOU as a scientist to see if the collected data is usable and trust them enough to support your hypothesis. The calculated p-value gives YOU a clue about how much to trust that data, not as a cut-off point for accepting it or not.
Reply to This
(Score: 3, Funny) by Anonymous Coward on Sunday January 28, @10:02PM (1 child)
Are 100 scientists from around the world a large enough sample size?
Reply to This
(Score: 5, Informative) by AthanasiusKircher on Sunday January 28, @10:33PM
From page 15 of the preprint:
This all sounds eminently reasonable. Focus on a single statistical parameter is never a good thing to determine whether a result is meaningful. People can p-hack at 0.005 just as they have at 0.05. If you think that's harder, you haven't thought about how easy it is in psychology when you're measuring a bunch of parameters and now just have to find a few more interesting ways to create combinations of data points that are hackable. I've seen plenty of studies which have claimed p thresholds of 0.005 or 0.001 or even more, but it's clear upon closer examination that they got those results through a combination of stuff like p-hacking, bad data collection, bad interpretation, biasing (either conscious or unconscious) the experimental design or calculation of results, etc.
So, focusing on a variety of stats that may or may not have particular relevance to a particular situation is good. Even better is the call for review of statistical standards BEFORE data collection. If you set your thresholds for significance or whatever, outline exactly how you plan to collect and manipulate the data, etc. IN ADVANCE, it's a lot harder (short of outright fabrication of data) to "massage" things to find something of supposed "significance." (And please note that a lot of this is likely done unintentionally: people just don't have an intuitive sense of how different stats or types of analysis may suddenly alter the thresholds or ease with which they can appear to have an interesting result.)
Sure, it's hard to set up these sorts of things for thorough statistical review in advance for an exploratory study where you're not quite sure what you may find. But in that case, you can be honest about how vague the results are -- and then follow-up studies with more rigor can be designed if some preliminary finding seems worthwhile. The point is being transparent about how the statistical standards were created for a particular study and how they were then applied to the data and interpreted.
Reply to This
(Score: 2) by opinionated_science on Sunday January 28, @11:34PM (3 children)
When used properly statistics are a beautiful tool to observer the world around us.
But you must factor in the sample size, and the balance of probabilities that data is correct.
If you don't know intimately what Bayesian or the Central limit theorems describe, quit trying to comment now - that is the ground floor in analysis...
Reply to This
(Score: 2) by deadstick on Monday January 29, @12:12AM
Statistics is like dynamite. Use it properly, and you can move mountains. Use it improperly, and the mountain will come down on you.
Reply to This
Parent
(Score: 3, Interesting) by FatPhil on Monday January 29, @12:27AM
Reply to This
Parent
(Score: 3, Interesting) by requerdanos on Monday January 29, @12:29AM
While based on solid information, this isn't necessarily good advice.
There are disciplines involved here other than probability theory, even though probability theory is at the root of what's going on.
Specifically:
Reporters (from "credentialed journalist" all the way to "dude I have this great science blog") form a group that needs desperately to understand how to read a scientific study and interface with, be the recipients of, the information calculated through probability theory.
Random idiots (from "I am pretty smart, and I like to weigh information carefully" all the way to "wow I better forward this clickbait to everyone just in case it's true") form another group, a further step removed, that need to learn how to call bullshit on the Reporters who say "New Study: Green Jelly Beans Linked To Acne [explainxkcd.com]" instead of blindly parroting what they say*.
Also involved are everyone else (all over the spectrum) who might be affected, which includes just about everyone.
I want to hear quality comments** from people who represent them, and from people who have useful advice for them.
----------
*Which has resulted in large numbers of people believing headlines, in sequence, of "New Study: Coffee Bad For You," "New Study: Coffee Not Bad For You After All," "New Study: Coffee Bad For You," "New Study: Coffee Not Bad For You After All," "New Study: Coffee Bad For You," "New Study: Coffee Not Bad For You After All," "New Study: Coffee Bad For You," "New Study: Coffee Not Bad For You After All." Despite it being impossible for two opposite oversimplifications to be true in the same universe.
** No. If you have to ask, then that would not be a quality comment. Thank you.
Reply to This
Parent
(Score: 2) by Entropy on Monday January 29, @01:27AM
So someone couldn't find a statistically significant relationship between something that they really wanted to, and they try to re-define what statistically significant is instead of just accepting what they found. That's how I read that load of crap, anyway.
Reply to This