Continuing my new practice of linking to and attempting to summarise statistics papers, here is a short piece by Andrew Gelman and Hal Stern:
Gelman, Andrew and Stern, Hal, ‘The Difference Between “Signiﬁcant” and “Not Signiﬁcant” is not Itself Statistically Signiﬁcant’, The American Statistician, November 2006, Vol. 60, No. 4 328-331 [pdf]
If I understand things aright, Gelman and Stern make the following point: that the emphasis on statistical significance in the reporting of results in the social sciences can lead to a misleadingly firm line being drawn between statistically significant and non-statistically significant results – which itself misrepresents the statistical significance of differences between results.
For example: if we are testing the same hypothesis against two different samples, and find a statistically significant result for one but not for another, this may lead us to draw a strong distinction between our two samples. One yields statistical significance and another does not – what difference could be clearer? Nevertheless, this does not itself indicate any statistically significant difference between our samples. If one test yields statistical significance at p = 0.0499, and another test does not yield statistical significance, at p = 0.0501, we have probably not discovered a dramatic difference between them. The actual difference between our samples is presumably tiny – yet because the difference in p value happens to bridge our choice of significance level, this difference can easily be reified, when equally large, or larger, differences between other samples are ignored.
This is intuitive enough – but the same point can apply even when the differences in p value are very substantial. Gelman and Stern write:
Consider two independent studies with effect estimates and standard errors of 25 ± 10 and 10 ± 10. The ﬁrst study is statistically signiﬁcant at the 1% level, and the second is not at all statistically signiﬁcant, being only one standard error away from 0. Thus, it would be tempting to conclude that there is a large difference between the two studies. In fact, however, the difference is not even close to being statistically signiﬁcant: the estimated difference is 15, with a standard error of … 14.
Additional problems arise when comparing estimates with different levels of information. Suppose in our example that there is a third independent study with much larger sample size that yields an effect estimate of 2.5 with standard error of 1.0. This third study attains the same signiﬁcance level as the ﬁrst study, yet the difference between the two is itself also signiﬁcant. Both ﬁnd a positive effect but with much different magnitudes. Does the third study replicate the ﬁrst study? If we restrict attention only to judgments of signiﬁcance we might say yes, but if we think about the effect being estimated we would say no, as noted by Utts (1991). In fact, the third study ﬁnds an effect size much closer to that of the second study, but now because of the sample size it attains signiﬁcance.
In a blog post that references this paper, Gelman writes:
I’m thinking more and more that we have to get rid of statistical significance, 95% intervals, and all the rest, and just come to a more fundamental acceptance of uncertainty.
I don’t yet know what Gelman means by this latter clause, or what alternative approaches he endorses.
April 9, 2012
There was a piece in the Guardian recently with the headline “Religious people are more likely to be leftwing, says thinktank Demos”
new research suggests… people with faith are far more likely to take left-of-centre positions on a range of issues… The report found that 55% of people with faith placed themselves on the left of politics, compared with 40% who placed themselves on the right.
The figures given here are unhelpful. The relevant comparison is of course not the percentage of people with faith who identify as left, versus the percentage of people with faith who identify as right – but, rather, the political positions of those with faith compared to the political positions of those without.
So – let’s look at the report – specifically figure 7. “The social and political views of people who belong to religious organisations and those who do not, in western European countries and the UK”
The cluster of bars C indicates that 55% of people in the UK who belong to a religious organisation (not “people with faith” as the article says, but so it goes) place themselves on the left in politics. So far so good. What about people in the UK who do not belong to a religious organisation – what percentage of this group places themselves on the left? Well, the chart is a bit hard to read, but we can go to Appendix B and look at table 17a to find that it’s 62%. [Chi-square p = 0.0125]
I.e. those who belong to religious organisations in the UK are on average considerably less likely to identify as left of centre than those who don’t. The headline is precisely wrong – it should read “Religious people are more likely to be rightwing…”
There’s plenty else wrong with the report and its coverage, but that’ll do. Fucking Demos.
[ PDF of the report here: http://www.demos.co.uk/files/Faithful_citizens_-_web.pdf?1333839181 ]
April 9, 2012
Another good piece on common misuses of statistics (full details at the bottom of the post) – this one demonstrating (among other things) that listening to different types of music will change your age:
Using the same method as in Study 1, we asked 20 University of Pennsylvania undergraduates to listen to either “When I’m Sixty-Four” by The Beatles or “Kalimba.” Then, in an ostensibly unrelated task, they indicated their birth date (mm/dd/yyyy) and their father’s age. We used father’s age to control for variation in baseline age across participants. An ANCOVA revealed the predicted effect: According to their birth dates, people were nearly a year-and-a-half younger after listening to “When I’m Sixty-Four” (adjusted M = 20.1 years) rather than to “Kalimba” (adjusted M = 21.5 years), F(1, 17) = 4.92, p = .040
The gag here, of course, is that if you have enough data, and you analyse it in enough different ways, you’ll be able to find a statistically significant result almost anywhere. The authors of the paper reproduce this same passage later, with some additional phrases added to give a fuller account of the data collection and analysis process:
Using the same method as in Study 1, we asked 20 34 University of Pennsylvania undergraduates to listen only to either “When I’m Sixty-Four” by The Beatles or “Kalimba” or “Hot Potato” by the Wiggles. We conducted our analyses after every session of approximately 10 participants; we did not decide in advance when to terminate data collection. Then, in an ostensibly unrelated task, they indicated only their birth date (mm/dd/yyyy) and how old they felt, how much they would enjoy eating at a diner, the square root of 100, their agreement with “computers are complicated machines,” their father’s age, their mother’s age, whether they would take advantage of an early-bird special, their political orientation, which of four Canadian quarterbacks they believed won an award, how often they refer to the past as “the good old days,” and their gender. We used father’s age to control for variation in baseline age across participants. An ANCOVA revealed the predicted effect: According to their birth dates, people were nearly a year-and-a-half younger after listening to “When I’m Sixty-Four” (adjusted M = 20.1 years) rather than to “Kalimba” (adjusted M = 21.5 years), F(1, 17) = 4.92, p = .040. Without controlling for father’s age, the age difference was smaller and did not reach significance (Ms = 20.3 and 21.2, respectively), F(1, 18) = 1.01, p = .33.
The authors dub this sort of problem “researcher degrees of freedom”. It is a form of data mining.
In the course of collecting and analyzing data, researchers have many decisions to make: Should more data be collected? Should some observations be excluded? Which conditions should be combined and which ones compared? Which control variables should be considered? Should specific measures be combined or transformed or both?
It is rare, and sometimes impractical, for researchers to make all these decisions beforehand. Rather, it is common (and accepted practice) for researchers to explore various analytic alternatives, to search for a combination that yields “statistical significance,” and to then report only what “worked.” The problem, of course, is that the likelihood of at least one (of many) analyses producing a falsely positive finding at the 5% level is necessarily greater than 5%.
The authors propose a set of guidelines for researchers to follow that will limit “researcher degrees of freedom” -
1. Authors must decide the rule for terminating data collection before data collection begins and report this rule in the article
2. Authors must collect at least 20 observations per cell or else provide a compelling cost-of-data collection justification.
3. Authors must list all variables collected in a study.
4. Authors must report all experimental conditions, including failed manipulations.
5. If observations are eliminated, authors must also report what the statistical results are if those observations are included.
6. If an analysis includes a covariate, authors must report the statistical results of the analysis without the covariate.
These solutions are oriented towards psychology, and many of them relate to the data collection/creation process and its reporting. I don’t know how one might effectively limit “researcher degrees of freedom” in a discipline like economics, where often the data is already public, and the “researcher degrees of freedom” can lie in analytic choices alone.
Joseph P. Simmons, Leif D. Nelson, and Uri Simonsohn, “False-Positive Psychology: Undisclosed Flexibility in Data Collection and Analysis Allows Presenting Anything as Significant” Psychological Science, XX(X) 1–8, 2011 [pdf!]
April 9, 2012
Browsing around on Google Scholar I ran across this accessible paper – which seems excellent, to my eyes – on the use and misuse of regression analysis. It’s focused on the use of the technique in criminology, but its claims apply more broadly.
Berk distinguishes between three different levels of regression analysis:
Level I: descriptive – simply identifying patterns in the data. No broader inferential or causal claims made. This can always be justified.
Level II: inferential – estimating parameters of a population, hypothesis testing, use of confidence intervals, etc. This can be justified if the data has been generated by probability sampling. If the data has not been generated by probability sampling, level II analysis is “difficult to justify” (485).
[Berk gives several types of justification that could be offered in this scenario: 1) Treating the data as the population (i.e. falling back to descriptive statistics); 2) Making the case that the data can be treated as if it were a probability sample (“rarely credible in practice”); 3) Treating the data as a random sample from an imaginary ‘superpopulation’ (“even more difficult to justify than inferences from the as-if strategy”); 4) Making use of a model of how the data was generated (risky, because the model might be wrong).]
Level III: causal – estimating causal relationships between variables in the population. “Perhaps too simply put, before the data are analyzed, one must have a causal model that is nearly right” (481) But: “It is very difficult to find empirical research demonstrably based on nearly right models.” (482)
Berk concludes that: “With rare exceptions, regression analyses of observational data are best undertaken at Level I. With proper sampling, a Level II analysis can be helpful.” Level III is very difficult to justify. Unfortunately: “The daunting part is getting the analysis past criminology gatekeepers. Reviewers and journal editors typically equate proper statistical practice with Level III.” (486)
April 9, 2012
As readers of the blog (if such there be) will know, my current ongoing project (or, I suppose, sub-project) is a document on the implications of the work of Robert Brandom for social theory. I’m keen to get that document done, but it’s on the back-burner for now, while work and life take well-deserved priority. Still, things churn away in the brain cell, and I’ve been thinking a bit what the next thing to do is, once the Brandom document is complete.
My master-plan for the overall project has (as, again, readers may conceivably recall) six broad stages. The first three of these were, in order:
1) Social-theoretic foundations [the Brandom document]
2) History of capitalism
3) Analysis of value theory
(The remaining stages were, more or less, variations on ‘do political economy’)
I’d planned to move on to my (very brief!) history of capitalism once the Brandom document was done. My idea was that social theory and economic history were the broad areas of study that need to inform an adequate economics, but that are under-represented in current economics education – so I thought I’d get some basic grounding in these areas, before approaching economics itself.
My worry: if I do this, I’ll never get to the economics :-P. So: I’ve had a rethink. I still intend to write a history of capitalism, but I’m now seeing this as something to do in the interstices of my other studies.
This leaves the question: what to move on to once the Brandom document is complete?
The way I currently see it, contemporary economics has two broad areas of technical expertise:
First, statistical analysis of economic data.
Second, modelling of economic structures.
And, of course (third, if you like), drawing connections between them.
Obviously each of these areas have their formal and contentful aspects – the formal being simply how to do statistical analysis or modelling; the contentful being analysis of actual economic data, or discussion of actual economic models.
I want to start getting to grips with these technical areas of economics sooner rather than later. My current background concern is: how. I’m considering taking a higher degree in statistics – or just autodidacting may way through the terrain, as usual. This will have the advantage of having real-world application (i.e. I can use it on the job market, which is important); and it has the further advantage that I have a lot of respect for orthodox statistical theory and practice (whereas I have very little respect for orthodox economic models and modelling) – so I’m less likely to flame out of these studies in the kind of bitter rage that motivates this blog as a space to do heterodox intellectual work. The downside is that statistics still isn’t economics proper – so it’s still a postponement of (what I regard as) the core of my project.
Just putting this up to externalise and help along the thought-process, really. Also to explain a shift in content on the blog – I’m still not blogging properly again, but I aim to starting putting up statistics-related content; this is why.
I hope folks out there are well.