I’ve been pointing out for years now that scientists simply don’t have the statistical mastery required to back up the statistics-based results that they’ve been pointing out:
The results were “plain as day”, recalls Motyl, a psychology PhD student at the University of Virginia in Charlottesville. Data from a study of nearly 2,000 people seemed to show that political moderates saw shades of grey more accurately than did either left-wing or right-wing extremists. “The hypothesis was sexy,” he says, “and the data provided clear support.” The P value, a common index for the strength of evidence, was 0.01 — usually interpreted as ‘very significant’. Publication in a high-impact journal seemed within Motyl’s grasp.
But then reality intervened. Sensitive to controversies over reproducibility, Motyl and his adviser, Brian Nosek, decided to replicate the study. With extra data, the P value came out as 0.59 — not even close to the conventional level of significance, 0.05. The effect had disappeared, and with it, Motyl’s dreams of youthful fame.
It turned out that the problem was not in the data or in Motyl’s analyses. It lay in the surprisingly slippery nature of the P value, which is neither as reliable nor as objective as most scientists assume. “P values are not doing their job, because they can’t,” says Stephen Ziliak, an economist at Roosevelt University in Chicago, Illinois, and a frequent critic of the way statistics are used.
For many scientists, this is especially worrying in light of the reproducibility concerns. In 2005, epidemiologist John Ioannidis of Stanford University in California suggested that most published findings are false; since then, a string of high-profile replication problems has forced scientists to rethink how they evaluate results.
Of course, if one considers the nonsensical hypotheses many of these scientists are attempting to statistically test, it is abundantly clear that they also lack a sufficient mastery of basic logic.
And notice that it is an economist who is a critic of the unreliable methods being used by the scientists. That’s not a coincidence. Economists are some of the biggest skeptics in the academic world, mostly because they see their models failing almost as soon as they are constructed. The fact is that the scientists quite literally have no idea what they’re talking about:
For all the P value’s apparent precision, Fisher intended it to be just one part of a fluid, non-numerical process that blended data and background knowledge to lead to scientific conclusions. But it soon got swept into a movement to make evidence-based decision-making as rigorous and objective as possible. This movement was spearheaded in the late 1920s by Fisher’s bitter rivals, Polish mathematician Jerzy Neyman and UK statistician Egon Pearson, who introduced an alternative framework for data analysis that included statistical power, false positives, false negatives and many other concepts now familiar from introductory statistics classes. They pointedly left out the P value.
But while the rivals feuded — Neyman called some of Fisher’s work mathematically “worse than useless”; Fisher called Neyman’s approach “childish” and “horrifying [for] intellectual freedom in the west” — other researchers lost patience and began to write statistics manuals for working scientists. And because many of the authors were non-statisticians without a thorough understanding of either approach, they created a hybrid system that crammed Fisher’s easy-to-calculate P value into Neyman and Pearson’s reassuringly rigorous rule-based system. This is when a P value of 0.05 became enshrined as ‘statistically significant’, for example. “The P value was never meant to be used the way it’s used today,” says Goodman.