## Problems with Regression

### April 9, 2012

Browsing around on Google Scholar I ran across this accessible paper – which seems excellent, to my eyes – on the use and misuse of regression analysis. It’s focused on the use of the technique in criminology, but its claims apply more broadly.

Berk distinguishes between three different levels of regression analysis:

Level I: descriptive – simply identifying patterns in the data. No broader inferential or causal claims made. This can always be justified.

Level II: inferential – estimating parameters of a population, hypothesis testing, use of confidence intervals, etc. This can be justified if the data has been generated by probability sampling. If the data has not been generated by probability sampling, level II analysis is “difficult to justify” (485).

[Berk gives several types of justification that could be offered in this scenario: 1) Treating the data as the population (i.e. falling back to descriptive statistics); 2) Making the case that the data can be treated as if it were a probability sample (“rarely credible in practice”); 3) Treating the data as a random sample from an imaginary ‘superpopulation’ (“even more difficult to justify than inferences from the as-if strategy”); 4) Making use of a model of how the data was generated (risky, because the model might be wrong).]

Level III: causal – estimating causal relationships between variables in the population. “Perhaps too simply put, before the data are analyzed, one must have a causal model that is nearly right” (481) But: “It is very difficult to find empirical research demonstrably based on nearly right models.” (482)

Berk concludes that: “With rare exceptions, regression analyses of observational data are best undertaken at Level I. With proper sampling, a Level II analysis can be helpful.” Level III is very difficult to justify. Unfortunately: “The daunting part is getting the analysis past criminology gatekeepers. Reviewers and journal editors typically equate proper statistical practice with Level III.” (486)

April 10, 2012 at 4:37 pm

The linked article summarizes sound practice in any field of scientific inquiry where one must decide which data analysis techniques are best suited to the empirical data in hand. The authors are indicting not the statistical methods of regression and causal analysis, nor even the uses of these methods in criminological research, but the unjustified conceptual inferences derived from interpreting the statistical findings. The researchers themselves should qualify supportable inferences from their data and analyses when they write up the results for review/publication. It’s often the case, though, that readers who aren’t particularly well-versed in research design and stats skip over the boring technical details and caveats outlined in Methods and Results and jump ahead to Discussion section, where with lyrical bravado the authors grant themselves more leeway in extrapolating beyond their findings. I don’t know much about criminology, but maybe the authors of this piece regard the field as lacking empirical sophistication in both the conduct and evaluation of empirical research methodology.

The authors of this piece make it sound as though using regression to build causal models is rarely justified, though they devote only one short paragraph to Level III. How does one arrive at a “nearly right model,” for example? One way is to use results from prior studies to select variables that have already demonstrated statistical correlation with the outcome of interest. Then you might use multiple regression to see which of these already-supported variables are the stronger predictors, which ones are so intercorrelated with each other that it’s possible to eliminate some of them even if in univariate analysis they look strong, and so on. Causal modeling uses correlation matrices from data collected at a single point in time to infer cause-effect cascades producing the final outcome of interest. This exercise is fraught with peril, since causes presumably precede effects in time. Here again you’d need some a priori justification for presuming that variable A precedes variable B in order to vouchsafe the statistical model: either prior research has demonstrated the sequencing, or from a logical standpoint A had to precede B. This sort of work is painstaking and cumulative but it can be done, often relying on work conducted by others in the field.

April 10, 2012 at 5:04 pm

Another thought… What constitutes a “nearly right model”? From the perspective of inferential statistics, “nearly right” means that nearly all of the variance in the outcome variable is accounted for by the cumulative effect of the predictor variables. In many fields such a high level of rightness is rarely approached very closely. A statistically significant regression model that accounts for maybe 35% of the variance can be regarded as an extremely strong result, setting the stage for the next round of research that might push toward 40% of the variance. I.e., while “rightness” might be the norm, it’s the hypothetical endpoint of a continuum. In practice statistical rightness is approached incrementally, iteratively, and imperfectly.

April 10, 2012 at 6:48 pm

Hi ktismatics.

Yes, exactly – this is why I linked to and summarised the article – I think it’s a good piece for this reason. Perhaps the title of the post is ill-chosen: “Problems with some common uses of regression” would have been clearer, since as you say the technique itself is blameless; application and interpretation is where the problems lie. Still I thought it was a good, salutary paper, making clearer than usual some risks and common pitfalls associated with the technique.

Well note that the paper begins with a discussion of Level III approaches – but yes – I take it that the author believes regression should not be used to build causal models themselves. Rather, once one has arrived at a causal model that one has considerable warranted confidence in, regression can be used to fill in the relevant values of the model’s parameters based on empirical data. The author’s complaint, I take it, is with the belief that regression analysis can be used to provide compelling evidence for the model itself, rather than to establish what the model parameters would be were the model sound.

Yes – I take it that the author would agree with and aspire to this, as do I. I think the author’s claim is that Criminology (his field, I presume) cannot be taken to have successfully developed quantitative models that we have warrant enough to regard as “nearly right”, such that we can justifiably nail down their parameters using regression. He’s not arguing that this is in principle impossible – but that it is a much more difficult task than many of his peers take it to be, and that over-hasty and over-interpretive use of purportedly causal regression analysis obscures rather than illuminates.

No, I think I disagree here – because this final clause can be the case even for descriptive use of regression (Berk’s level I). Berk’s point, or one of them, I take it, is that the association of outcome variable variance with predictor variable variance doesn’t in itself do anything other than describe patterns in the data. To make claims about population rather than sample requires something else again (probability sampling or plausible as-if substitute); and to make causal claims, as simple regression analysis is often purported to enable, requires something still further – a strongly and independently warranted causal model. The slippage from “nearly all the variance in

zcan be accounted for by [that is, in the first instance, is associated with] variance inw,xandy” to “we have a nearly right model” to “we know something about the population (because we have a nearly right model)” to “we know something causal about the population (because we have a nearly right model)” is unwarranted. I take it that Berk’s purpose is to reduce the ease with which this movement can be made, by practitioners in this field.Berk has a book which I assume expands on the content of this paper, but which I haven’t read:

http://www.powells.com/biblio?PID=27627&cgi=product&isbn=0-7619-2904-5

At Three Toed Sloth, Cosma Shalizi writes “I won’t say this belongs on every practitioner’s shelf, because it’s not the kind of book you’ll come back to again and again. I will say that anyone who is becoming a practitioner of regression needs to learn these lessons thoroughly, for their own sake and for the sake of anyone who might have to rely on their findings” which sounds like a plausible summary, and recommends it to me.

http://cscs.umich.edu/~crshalizi/weblog/algae-2007-11.html#berk-on-regression

Andrew Gelman discusses it here, in his usual piecemeal fashion:

http://andrewgelman.com/2006/05/richard_berks_b/

April 10, 2012 at 7:12 pm

Totally agree with this, btw – and not just for statistical rightness, but for scientific truth-claims in general.

April 10, 2012 at 7:58 pm

We’re probably in agreement even on the part where we disagree. Treading cautiously into metaphysical ground, I’m not sure that empirical findings can ever do anything other than demonstrate patterns in the data, regardless of the techniques used to analyze those data. Data analyses are used to evaluate the extent to which the empirical findings correspond to the hypothesis driving the study. The match between so-called statistical rightness and theoretical rightness is always a matter of degree, probability, confidence interval, and so on, whether it’s a descriptive or a predictive hypothesis you’re trying to evaluate.

There are, of course, researchers who use the statistics simultaneously to generate and to validate a hypothesis: that’s what happens in data mining, which you pillory in your next post. Also, in the Discussion section the researchers often overstate the match between empirical findings and theory, as if not falsifying the hypothesis on probabilistic grounds is the same as verifying it.

I believe that you could teach yourself descriptive and inferential statistics if you got yourself a good book or two. You’ve already demonstrated the aptitude for seeing behind the PR apparatus of research overhype, as demonstrated in our discussions of Tomasello and in your post about religious liberalism.

April 11, 2012 at 1:52 pm

Well but of course the inferential part of inferential statistics involves making claims about a broader population – so, yes, we can’t know for sure whether our inferences are right, but the inference is often the goal, and pattern-demonstration is the means. If we’ve made our inferences well then our claims should have a decent chance of predictive capacity (predicting likely patterns in future data drawn from the same population, assuming appropriate sampling techniques). But if we’re making causal claims then we are still more ambitious – we are aspiring not just to predict but also to explain, and we need something beyond even repeated successful pattern-identification and -prediction to justify our model. It’s worth underlining, as Berk does in this piece, that to claim a causal model is nearly right is to make a different and stronger claim than to say that a dependent variable’s variance is accounted for by [/associated with] the variance in independent variables. So yes, at one level it’s all about demonstrating patterns in the data, but we have sound reasons for taking some patterns to warrant conclusions about phenomena beyond the dataset itself. Too many researchers, though, seem confused about which inferences are warranted and which are not.

Thanks, yes – I’ve already read a good book or two, of course – I’m looking to extend my knowledge, not start from scratch – but yes, I’m sure I can autodidact my way to where I want to go if necessary. The issue is that, in my experience, people are often unwilling to attribute knowledge or understanding if one cannot produce the appropriate institutional seal of authority. Whether this is enough to make acquiring such a seal worthwhile, I don’t know.