Robert Wiblin writes:
If we have a study on the impact of a social program in a particular place and time, how confident can we be that we’ll get a similar result if we study the same program again somewhere else? Dr Eva Vivalt . . . compiled a huge database of impact evaluations in global development – including 15,024 estimates from 635 papers across 20 types of intervention – to help answer this question. Her finding: not confident at all. The typical study result differs from the average effect found in similar studies so far by almost 100%. That is to say, if all existing studies of an education program find that it improves test scores by 0.5 standard deviations – the next result is as likely to be negative or greater than 1 standard deviation, as it is to be between 0-1 standard deviations. She also observed that results from smaller studies conducted by NGOs – often pilot studies – would often look promising. But when governments tried to implement scaled-up versions of those programs, their performance would drop considerably.
Wiblin continues:
For researchers hoping to figure out what works and then take those programs global, these failures of generalizability and ‘external validity’ should be disconcerting. Is ‘evidence-based development’ writing a cheque its methodology can’t cash? Should we invest more in collecting evidence to try to get reliable results? Or, as some critics say, is interest in impact evaluation distracting us from more important issues, like national economic reforms that can’t be tested in randomised controlled trials?
Wiblin also points to this article by Mary Ann Bates and Rachel Glennerster who argue that “rigorous impact evaluations tell us a lot about the world, not just the particular contexts in which they are conducted” and write:
If researchers and policy makers continue to view results of impact evaluations as a black box and fail to focus on mechanisms, the movement toward evidence-based policy making will fall far short of its potential for improving people’s lives.
I agree with this quote from Bates and Gellenerst, and I think the whole push-a-button, take-a-pill, black-box attitude toward causal inference has been a disastrous mistake. I feel particularly bad about this, given that econometrics and statistics textbooks, including my own, have been pushing this view for decades.
Stepping back a bit, I agree with Vivalt that, if we want to get a sense of what policies to enact, it can be a mistake to try to be making these decisions based on the results of little experiments. There’s nothing wrong with trying to learn from demonstration studies (as here), but generally I think realism is more important than randomization. And, when effects are highly variable and measurements are noisy, you can’t learn much even from clean experiments.