Scientists don’t understand statistics

Which is why it’s good that hundreds of them are signing on to an effort to abandon the concept of “statistical significance”:

Let’s be clear about what must stop: we should never conclude there is ‘no difference’ or ‘no association’ just because a P value is larger than a threshold such as 0.05 or, equivalently, because a confidence interval includes zero. Neither should we conclude that two studies conflict because one had a statistically significant result and the other did not. These errors waste research efforts and misinform policy decisions.

For example, consider a series of analyses of unintended effects of anti-inflammatory drugs. Because their results were statistically non-significant, one set of researchers concluded that exposure to the drugs was “not associated” with new-onset atrial fibrillation (the most common disturbance to heart rhythm) and that the results stood in contrast to those from an earlier study with a statistically significant outcome.

Now, let’s look at the actual data. The researchers describing their statistically non-significant results found a risk ratio of 1.2 (that is, a 20{2b1141b7891b3a9a6e789b6011ce7f6b4c83be08b36e8974656edf3aca7b95b9} greater risk in exposed patients relative to unexposed ones). They also found a 95{2b1141b7891b3a9a6e789b6011ce7f6b4c83be08b36e8974656edf3aca7b95b9} confidence interval that spanned everything from a trifling risk decrease of 3{2b1141b7891b3a9a6e789b6011ce7f6b4c83be08b36e8974656edf3aca7b95b9} to a considerable risk increase of 48{2b1141b7891b3a9a6e789b6011ce7f6b4c83be08b36e8974656edf3aca7b95b9} (P = 0.091; our calculation). The researchers from the earlier, statistically significant, study found the exact same risk ratio of 1.2. That study was simply more precise, with an interval spanning from 9{2b1141b7891b3a9a6e789b6011ce7f6b4c83be08b36e8974656edf3aca7b95b9} to 33{2b1141b7891b3a9a6e789b6011ce7f6b4c83be08b36e8974656edf3aca7b95b9} greater risk (P = 0.0003; our calculation).

It is ludicrous to conclude that the statistically non-significant results showed “no association”, when the interval estimate included serious risk increases; it is equally absurd to claim these results were in contrast with the earlier results showing an identical observed effect. Yet these common practices show how reliance on thresholds of statistical significance can mislead us…. The trouble is human and cognitive more than it is statistical: bucketing results into ‘statistically significant’ and ‘statistically non-significant’ makes people think that the items assigned in that way are categorically different. The same problems are likely to arise under any proposed statistical alternative that involves dichotomization, whether frequentist, Bayesian or otherwise.

Unfortunately, the false belief that crossing the threshold of statistical significance is enough to show that a result is ‘real’ has led scientists and journal editors to privilege such results, thereby distorting the literature.

It’s important to remember that most scientists have no more training in statistics than any other college graduate. And even if they did sit through an extra class or two devoted to the subject, that doesn’t mean they are any good at it.