Someone named Andrew Certain writes:
I’ve been reading your blog since your appearance on Econtalk . . . explaining the ways in which statistics are misused/misinterpreted in low-sample/high-noise studies. . . . I recently came across a meta-analysis on stereotype threat [a reanalysis by Emil Kirkegaard] by that identified a clear relationship between smaller sample sizes and higher effect estimates. However, their conclusion [that is, the conclusion of the original paper, The influence of stereotype threat on immigrants: review and meta-analysis, by Markus Appel, Silvana Weber, and Nicole Kronberger] seems to be the following: In order for the effects estimated by the small samples to be spurious, there would have to be a large number of small-sample studies that showed no effect. Since that number is so large, even accounting for the file-drawer effect, and because they can’t find those null-effect studies, the effect size must be large. Am I misinterpreting their argument? Is it as crazy at it sounds to me?
My reply:
I’m not sure. I didn’t see where in the paper they said that the effect size must be large. But I do agree that something seemed odd in their discussion, in that first they said that there were suspiciously few small-n studies showing small effect size estimates, but then they don’t really do much with that conclusion.
Here’s the relevant bit from the Appel et al. paper:
Taken together, our sampling analysis pointed out a remarkable lack of null effects in small sample studies. If such studies were conducted, they were unavailable to us. A file-drawer analysis showed that the number of studies in support of the null hypothesis that were needed to change the average effect size to small or even to insubstantial is rather large. Thus, we conclude that the average effect size in support of a stereotype threat effect among people with an immigrant background is not severely challenged by potentially existing but unaccounted for studies.
I’m not quite ready to call this “crazy” because maybe I’m just missing something here.
I will say, though, that I expect that “file drawer” is much less of an issue here than “forking paths.” That is, I don’t think there are zillions of completed studies that yielded non-statistically-significant results and then were put away in a file. Rather, I think that researchers manage to find statistically significant comparisons each time, or most of the time, using the data they have. And in that case the whole “count how many hidden studies would have to be in the file drawer” thing is irrelevant.
P.S. “Andrew Certain” . . . what a great name! I’d like to be called Andrew Uncertain; that would be a good name for a Bayesian.