By the Australian National Bureau of Statistics’ measure, life expectancy at birth for Australians is 78.5 years for men, and 82.4 years for women (figures for 2005-2007).

These aren’t precisely the same as the World Bank’s figures for the same period (which are slightly higher) – but they’re pretty close. (The World Bank figures for 2006, which I’m assuming is the best comparison year for this data, are 78.7 for men, 83.5 for women, and 81 overall).

These figures are outstanding: they put Australia above the average life expectancy for high income countries. (Which is: for men 76.2, for women 81.9, overall 79) Australia is in the top ten countries in the world for life expectancy.

But these numbers don’t capture demographic differentials within Australia. Notably, Indigenous Australians have markedly lower life expectancies than those for the country overall.

In 2006-7, the life expectancy at birth for Indigenous Australians was 67.2 for men, and 72.9 for women. (Weighted by the relative size of the male and female populations, I make that an overall life expectancy of 70.06)

If Indigenous Australians were their own country, the life expectancy of that country would fall between those of Guatemala and Morocco.

Now looking just at the Indigenous Australian population of the Northern Territory, life expectancy for men is 61.5; for women 69.2. Again weighted by male and female population, I get an average life expectancy for this population overall of 65.41.

That figure is less than the average life expectancy for the world. If the Northern Territory’s Indigenous Australian population were its own country, that country would have a life expectancy just higher than Bhutan, and a bit lower than the Solomon Islands.

This doesn’t seem to be a defensible state of affairs.

NHS Cuts by Region

May 23, 2012

Éoin Clarke’s blog has a post on the uneven geographical distribution of NHS cuts. He writes:

The wealthiest, and dare I say it Toriest, parts of England have actually experienced no job losses. The South East of England has actually grown its NHS workforce since the May General Election, while the North West of England alone has experienced more than 6,500 job losses.

His post includes a chart. Clarke’s chart shows absolute figures – I thought I’d make my own version of it, showing percentage change. This doesn’t make any real difference to the story, but here it is anyway. Note that these figures are Hospital and Community Health Service staff, excluding primary care staff – lots of NHS employment isn’t captured.

Click the chart to enlarge it. Data from here.

The mathematics of inferential statistics is based on the logic of random sampling: the inferences we make in inferential statistics work on the assumption that the data we are inferring from is randomly sampled from the population we are inferring to – that every member of the population has an equal chance of ending up in our dataset. Obviously this usually isn’t the case; but that’s the assumption, and the further our actual sampling practice deviates from that ideal situation, the less likely our inferences are to have any validity.

In much inferential statistics, the population we are sampling from is an actual population of cases, which could in principle be observed directly if we only had the money, time, staff, access, etc. etc. Here the ideal situation is to create a sampling frame that lists all the cases in the population, randomly select a subset of cases from the sampling frame, and then collect data from those cases we’ve selected. In practice, of course, most data collection doesn’t work this way – instead researchers pick a convenience sample of some kind (sometimes lazily, sometimes unavoidably) and then try to make the argument that this sampling method is unlikely to be strongly biased in any relevant way.

Sometimes, however, the population from which we draw our sample is not an actual population of cases that happen for contingent practical reasons to be beyond the reach of observation. Sometimes the population from which we draw our sample is a purely theoretical entity – a population of possible circumstances, from which actuality has drawn, or realised, one specific instance. Thus our actual historical present is a ‘sample’ from a ‘population’ of possible realities, and the generalisations we aim to make from our sample is a generalisation to the space of possibilities, rather than simply to some aspect of crass and meagre fact.

When we make claims that are predictive of future events, not merely of future observations of present events, we are, tacitly or overtly, engaged in this endeavour. To predict the future is to select one possible reality out of a space of possibilities, and to attribute a likelihood to this prediction is to engage in the statistical practice of assigning probability figures to a range of estimates of underlying population parameters – or, equivalently, to give probability figures to a range of estimates of future sample statistics ‘drawn from’ that underlying population. I may try to articulate this point with more precision in a future post – I’d like to spend more time on Bayesian vs. frequentist approaches to probability. And there is, of course, a ‘metaphysical’ question as to whether such a ‘population’ ‘really exists’, or whether the ‘samples’ themselves are the only reality, and the ‘population’ a speculative theoretical entity derived from our experience of those samples. Functionally, however, these stances are identical: and by my pragmatist lights, to note such functional equivalence is to collapse the two possibilities together for most theoretical purposes.

When we speak of universal natural laws, then, we are stating that a given fact – the law in question – will be true in the entire range of possible worlds that might, in the future, be actualised in reality. (Whether this ‘possibility’ should be understood in ontological or epistemological terms is beside the point). For some, it is the role of science to make such predictions: on this erroneous stance, science attempts to identify universal features of reality, and any uncertainty that accrues to scientific results is the uncertainty of epistemological weakness, rather than ontological variation. Here, for example, is a video of Richard Feynman making fun of social science for its inability to formulate universal laws of history:

To take this attitude is to misunderstand the nature not just of social science, but of science in general. Science is not characterised by a quest for certainty or for permanence, but is rather characterised by an ongoing collective process of hypothesis formation and assessment, based on specific collectively accepted evidentiary standards. The conclusions of science cannot be certain, because they must always be vulnerable to refutation in the light of empirical evidence and the application of community norms of argument. Similarly, the phenomena examined by science need not be necessary, or even ongoing. A scientific endeavour can be entirely descriptive, of the most local and variable phenomena imaginable, so long as the process of description is subject to the appropriate communal evidentiary norms. It can, similarly, be explanatory without being predictive, for we can analyse the causes of the phenomena we observe without being able reliably to predict those causes’ future impacts and interactions. The set of phenomena regarding which long-term or even short-term reliably predictive hypotheses can be formed is smaller than the set of phenomena that can be studied empirically using the relevant community norms of hypothesis formation and assessment.

The social sciences often approach this limit case of the purely descriptive. Social reality is enormously variegated – and often there is little in the way of testable general claims that can be taken from a study of any given social phenomenon. But prediction is nevertheless sometimes the goal of social science. When the social sciences aim to study social phenomena, the ‘laws’ they aspire to uncover are always local and limited in scope – and when we form a hypothesis, this hypothesis applies within a certain local limit and no further. Where to draw the line – where to locate this limit – is a qualitative question that the community of social scientists must always bear in mind, but the existence of this limit in no way renders the endeavour ‘unscientific’.

When we make a social-scientific prediction, then, we are making a claim about what future reality will drawn from the space of possibility. We do not know the scope of this space – nor do we have any reason to regard the principle of selection as random or unbiased – indeed, we have strong reasons to believe the contrary. Further, the nature of social reality is such that we can and do aspire to intervene in this selection – to attempt to influence what possibilities are realised. As social scientists we sometimes aim to predict what outcomes will be drawn from this space of possibilities – and such a prediction can only be made within the framework of a broader, historically informed judgement of the narrower space, within the space of possibilities, that we aspire to model.

But we should also be aware of other, unrealised but potentially realisable social possibilities, beyond the set of possibilities we are modelling at any given moment. Part of the function of the scrupulous social scientist is to describe this space of possibilities itself – to describe not just regularities, but also the possible variety from within which those local regularities are drawn. We cannot know the limits to the space of possibilities – no sampling frame of possible societies exists. But we can explore what the ‘samples’ themselves – existing and historical societies and behaviours – tell us about the scope of that hypothetical space.

This latter task is where social science intersects with political practice. The understanding of the likely behaviour of social reality is important for political practice – but so too is a sense of the larger space of possibilities from which our own past and present societies have been drawn, and from which alternative futures could be drawn, or made, if we only had the political ability to do so.

Continuing my new practice of linking to and attempting to summarise statistics papers, here is a short piece by Andrew Gelman and Hal Stern:

Gelman, Andrew and Stern, Hal, ‘The Difference Between “Significant” and “Not Significant” is not Itself Statistically Significant’, The American Statistician, November 2006, Vol. 60, No. 4 328-331 [pdf]

If I understand things aright, Gelman and Stern make the following point: that the emphasis on statistical significance in the reporting of results in the social sciences can lead to a misleadingly firm line being drawn between statistically significant and non-statistically significant results – which itself misrepresents the statistical significance of differences between results.

For example: if we are testing the same hypothesis against two different samples, and find a statistically significant result for one but not for another, this may lead us to draw a strong distinction between our two samples. One yields statistical significance and another does not – what difference could be clearer? Nevertheless, this does not itself indicate any statistically significant difference between our samples. If one test yields statistical significance at p = 0.0499, and another test does not yield statistical significance, at p = 0.0501, we have probably not discovered a dramatic difference between them. The actual difference between our samples is presumably tiny – yet because the difference in p value happens to bridge our choice of significance level, this difference can easily be reified, when equally large, or larger, differences between other samples are ignored.

This is intuitive enough – but the same point can apply even when the differences in p value are very substantial. Gelman and Stern write:

Consider two independent studies with effect estimates and standard errors of 25 ± 10 and 10 ± 10. The first study is statistically significant at the 1% level, and the second is not at all statistically significant, being only one standard error away from 0. Thus, it would be tempting to conclude that there is a large difference between the two studies. In fact, however, the difference is not even close to being statistically significant: the estimated difference is 15, with a standard error of … 14.

Additional problems arise when comparing estimates with different levels of information. Suppose in our example that there is a third independent study with much larger sample size that yields an effect estimate of 2.5 with standard error of 1.0. This third study attains the same significance level as the first study, yet the difference between the two is itself also significant. Both find a positive effect but with much different magnitudes. Does the third study replicate the first study? If we restrict attention only to judgments of significance we might say yes, but if we think about the effect being estimated we would say no, as noted by Utts (1991). In fact, the third study finds an effect size much closer to that of the second study, but now because of the sample size it attains significance.

In a blog post that references this paper, Gelman writes:

I’m thinking more and more that we have to get rid of statistical significance, 95% intervals, and all the rest, and just come to a more fundamental acceptance of uncertainty.

I don’t yet know what Gelman means by this latter clause, or what alternative approaches he endorses.

There was a piece in the Guardian recently with the headline “Religious people are more likely to be leftwing, says thinktank Demos”

new research suggests… people with faith are far more likely to take left-of-centre positions on a range of issues… The report found that 55% of people with faith placed themselves on the left of politics, compared with 40% who placed themselves on the right.

The figures given here are unhelpful. The relevant comparison is of course not the percentage of people with faith who identify as left, versus the percentage of people with faith who identify as right – but, rather, the political positions of those with faith compared to the political positions of those without.

So – let’s look at the report – specifically figure 7. “The social and political views of people who belong to religious organisations and those who do not, in western European countries and the UK”

The cluster of bars C indicates that 55% of people in the UK who belong to a religious organisation (not “people with faith” as the article says, but so it goes) place themselves on the left in politics. So far so good. What about people in the UK who do not belong to a religious organisation – what percentage of this group places themselves on the left? Well, the chart is a bit hard to read, but we can go to Appendix B and look at table 17a to find that it’s 62%. [Chi-square p = 0.0125]

I.e. those who belong to religious organisations in the UK are on average considerably less likely to identify as left of centre than those who don’t. The headline is precisely wrong – it should read “Religious people are more likely to be rightwing…”

There’s plenty else wrong with the report and its coverage, but that’ll do. Fucking Demos.

[ PDF of the report here: http://www.demos.co.uk/files/Faithful_citizens_-_web.pdf?1333839181 ]

Another good piece on common misuses of statistics (full details at the bottom of the post) – this one demonstrating (among other things) that listening to different types of music will change your age:

Using the same method as in Study 1, we asked 20 University of Pennsylvania undergraduates to listen to either “When I’m Sixty-Four” by The Beatles or “Kalimba.” Then, in an ostensibly unrelated task, they indicated their birth date (mm/dd/yyyy) and their father’s age. We used father’s age to control for variation in baseline age across participants. An ANCOVA revealed the predicted effect: According to their birth dates, people were nearly a year-and-a-half younger after listening to “When I’m Sixty-Four” (adjusted M = 20.1 years) rather than to “Kalimba” (adjusted M = 21.5 years), F(1, 17) = 4.92, p = .040

The gag here, of course, is that if you have enough data, and you analyse it in enough different ways, you’ll be able to find a statistically significant result almost anywhere. The authors of the paper reproduce this same passage later, with some additional phrases added to give a fuller account of the data collection and analysis process:

Using the same method as in Study 1, we asked 20 34 University of Pennsylvania undergraduates to listen only to either “When I’m Sixty-Four” by The Beatles or “Kalimba” or “Hot Potato” by the Wiggles. We conducted our analyses after every session of approximately 10 participants; we did not decide in advance when to terminate data collection. Then, in an ostensibly unrelated task, they indicated only their birth date (mm/dd/yyyy) and how old they felt, how much they would enjoy eating at a diner, the square root of 100, their agreement with “computers are complicated machines,” their father’s age, their mother’s age, whether they would take advantage of an early-bird special, their political orientation, which of four Canadian quarterbacks they believed won an award, how often they refer to the past as “the good old days,” and their gender. We used father’s age to control for variation in baseline age across participants. An ANCOVA revealed the predicted effect: According to their birth dates, people were nearly a year-and-a-half younger after listening to “When I’m Sixty-Four” (adjusted M = 20.1 years) rather than to “Kalimba” (adjusted M = 21.5 years), F(1, 17) = 4.92, p = .040. Without controlling for father’s age, the age difference was smaller and did not reach significance (Ms = 20.3 and 21.2, respectively), F(1, 18) = 1.01, p = .33.

The authors dub this sort of problem “researcher degrees of freedom”. It is a form of data mining.

In the course of collecting and analyzing data, researchers have many decisions to make: Should more data be collected? Should some observations be excluded? Which conditions should be combined and which ones compared? Which control variables should be considered? Should specific measures be combined or transformed or both?

It is rare, and sometimes impractical, for researchers to make all these decisions beforehand. Rather, it is common (and accepted practice) for researchers to explore various analytic alternatives, to search for a combination that yields “statistical significance,” and to then report only what “worked.” The problem, of course, is that the likelihood of at least one (of many) analyses producing a falsely positive finding at the 5% level is necessarily greater than 5%.

The authors propose a set of guidelines for researchers to follow that will limit “researcher degrees of freedom” -

1. Authors must decide the rule for terminating data collection before data collection begins and report this rule in the article

2. Authors must collect at least 20 observations per cell or else provide a compelling cost-of-data collection justification.

3. Authors must list all variables collected in a study.

4. Authors must report all experimental conditions, including failed manipulations.

5. If observations are eliminated, authors must also report what the statistical results are if those observations are included.

6. If an analysis includes a covariate, authors must report the statistical results of the analysis without the covariate.

These solutions are oriented towards psychology, and many of them relate to the data collection/creation process and its reporting. I don’t know how one might effectively limit “researcher degrees of freedom” in a discipline like economics, where often the data is already public, and the “researcher degrees of freedom” can lie in analytic choices alone.

Joseph P. Simmons, Leif D. Nelson, and Uri Simonsohn, “False-Positive Psychology: Undisclosed Flexibility in Data Collection and Analysis Allows Presenting Anything as Significant” Psychological Science, XX(X) 1–8, 2011 [pdf!]

Problems with Regression

April 9, 2012

Browsing around on Google Scholar I ran across this accessible paper – which seems excellent, to my eyes – on the use and misuse of regression analysis. It’s focused on the use of the technique in criminology, but its claims apply more broadly.

Berk, Richard, “What You Can and Can’t Properly Do with Regression“, Journal of Quantitative Criminology, Volume 26, Number 4, 2010, pp. 481-487

Berk distinguishes between three different levels of regression analysis:

Level I: descriptive – simply identifying patterns in the data. No broader inferential or causal claims made. This can always be justified.

Level II: inferential – estimating parameters of a population, hypothesis testing, use of confidence intervals, etc. This can be justified if the data has been generated by probability sampling. If the data has not been generated by probability sampling, level II analysis is “difficult to justify” (485).

[Berk gives several types of justification that could be offered in this scenario: 1) Treating the data as the population (i.e. falling back to descriptive statistics); 2) Making the case that the data can be treated as if it were a probability sample (“rarely credible in practice”); 3) Treating the data as a random sample from an imaginary ‘superpopulation’ (“even more difficult to justify than inferences from the as-if strategy”); 4) Making use of a model of how the data was generated (risky, because the model might be wrong).]

Level III: causal – estimating causal relationships between variables in the population. “Perhaps too simply put, before the data are analyzed, one must have a causal model that is nearly right” (481) But: “It is very difficult to find empirical research demonstrably based on nearly right models.” (482)

Berk concludes that: “With rare exceptions, regression analyses of observational data are best undertaken at Level I. With proper sampling, a Level II analysis can be helpful.” Level III is very difficult to justify. Unfortunately: “The daunting part is getting the analysis past criminology gatekeepers. Reviewers and journal editors typically equate proper statistical practice with Level III.” (486)

Follow

Get every new post delivered to your Inbox.

Join 48 other followers