## The Statistical Significance of Differences in Statistical Significance

### April 12, 2012

Continuing my new practice of linking to and attempting to summarise statistics papers, here is a short piece by Andrew Gelman and Hal Stern:

If I understand things aright, Gelman and Stern make the following point: that the emphasis on statistical significance in the reporting of results in the social sciences can lead to a misleadingly firm line being drawn between statistically significant and non-statistically significant results – which itself misrepresents the statistical significance of differences between results.

For example: if we are testing the same hypothesis against two different samples, and find a statistically significant result for one but not for another, this may lead us to draw a strong distinction between our two samples. One yields statistical significance and another does not – what difference could be clearer? Nevertheless, this does not itself indicate any statistically significant difference between our samples. If one test yields statistical significance at p = 0.0499, and another test does not yield statistical significance, at p = 0.0501, we have probably not discovered a dramatic difference between them. The actual difference between our samples is presumably tiny – yet because the difference in p value happens to bridge our choice of significance level, this difference can easily be reified, when equally large, or larger, differences between other samples are ignored.

This is intuitive enough – but the same point can apply even when the differences in p value are very substantial. Gelman and Stern write:

Consider two independent studies with effect estimates and standard errors of 25 ± 10 and 10 ± 10. The ﬁrst study is statistically signiﬁcant at the 1% level, and the second is not at all statistically signiﬁcant, being only one standard error away from 0. Thus, it would be tempting to conclude that there is a large difference between the two studies. In fact, however, the difference is not even close to being statistically signiﬁcant: the estimated difference is 15, with a standard error of … 14.

Additional problems arise when comparing estimates with different levels of information. Suppose in our example that there is a third independent study with much larger sample size that yields an effect estimate of 2.5 with standard error of 1.0. This third study attains the same signiﬁcance level as the ﬁrst study, yet the difference between the two is itself also signiﬁcant. Both ﬁnd a positive effect but with much different magnitudes. Does the third study replicate the ﬁrst study? If we restrict attention only to judgments of signiﬁcance we might say yes, but if we think about the effect being estimated we would say no, as noted by Utts (1991). In fact, the third study ﬁnds an effect size much closer to that of the second study, but now because of the sample size it attains signiﬁcance.

In a blog post that references this paper, Gelman writes:

I’m thinking more and more that we have to get rid of statistical significance, 95% intervals, and all the rest, and just come to a more fundamental acceptance of uncertainty.

I don’t yet know what Gelman means by this latter clause, or what alternative approaches he endorses.

April 13, 2012 at 6:30 pm

I couldn’t get the links to work but anyhow, here’s another one. Say you’ve collected data on 10 variables and you’re trying to figure out if they correlate with each other. You do a series of pairwise contrasts for the 20 variables — a with b, a with c, etc. — and find that 10 of these contrasts yield correlations that are significant at the p<.05 level — not bad. However… 20 variables yield (20 x 19)/2 = 190 pairwise contrasts. You've set the significance level at .05, meaning that you're interested in correlations only if there is less than a 5% chance of their having occurred through random variation in the sample. How many of the 190 pairwise contrasts are likely to show random pairwise correlations with each other in your sample? Per Boole's Inequality the answer is np, which in this case is .05 x 190 = 9.5. So your 10 statistically significant correlations are just about what you'd expect from random variation in the sample. There are recognized corrections to compensate for Boole's inequality; it would be an interesting analysis to see how many studies reporting significant findings in some field of study actually perform the adjustment.

April 13, 2012 at 6:32 pm

I meant to say that you’ve collected data on

20variables.April 14, 2012 at 5:46 am

Yes, exactly – the dreaded “I did 1000 t tests and nearly 50 of them were significant (p<0.05)! I'm going to publish those results (and barely even mention the others)!"

One of the many things I don't know but should, is how the various post-hoc tests one can run after a statistically significant ANOVA result (for example) actually work – the mechanism by which (and thus extent to which…) the different tests aim to compensate for the type I error problem. Without that knowledge I can't myself judge whether or to what extent any or which tests might be appropriate to use in any given circumstance (one relies instead on others' rules of thumb…) There is obviously a large literature on these issues (which are, after all, pretty basic) (and which extends beyond post-hoc tests) – I just haven't read it. Maybe this is something I could look at in future posts.

Not quite the same issue, but the same family of problem: have you seen this xkcd strip?

http://xkcd.com/882/

April 14, 2012 at 12:52 pm

I don’t much care for green jellybeans anyhow.

April 14, 2012 at 1:16 pm

🙂