You’ve got data on 35 countries, but it’s really just N=3 groups.

Jon Baron points to a recent article, “Societal inequalities amplify gender gaps in math,” by Thomas Breda, Elyès Jouini, and Clotilde Napp (supplementary materials here), and writes:

A particular issue bothers me whenever I read studies like this, which use nations as the unit of analysis and then make some inference from correlations across nations. And I suspect that the answer to my concern is well known (but maybe not to the authors of these studies, and not to me) and you could just say what it is. My concern is this. The results here are based on correlations across 35 nations. But, if you look at the supplement, you will see that these nations fall into distinct groups. A whole bunch of them are “northern European.” These countries are related to each other in many ways, including geography, history, and culture. It seems to me that these kinds of relationships reduce the true number of observations to some number much smaller than 35. The whole result in the paper could come from the fact that this particular culture happens to have two features: low inequality and high emphasis on women’s education. In reality, these two features need not have any relationship. (I suspect that Japan and China would help break that correlation, for example.) To take an extreme example, suppose you had a sample consisting of all the countries of Europe and all the countries of Africa. That would be quite a few countries. And, within that total sample, you could find lots of highly significant correlations. But this would clearly be an N of 2, for all practical purposes. Or suppose we count all the U.S. states as if they were separate “states” (in the sense of nations). Why not? Surely Massachusetts and Alabama are no more similar than Germany and Norway. I can’t think of a purely statistical way to solve this problem. But that doesn’t mean there isn’t one.

I replied with a link to this discussion from a few years ago on that controversial claim that high genetic diversity, or low genetic diversity, is bad for the economy, in which I wrote:

Two economics professors, Quamrul Ashraf and Oded Galor, wrote a paper, “The Out of Africa Hypothesis, Human Genetic Diversity, and Comparative Economic Development,” that is scheduled to appear in the American Economic Review. . . . Ashraf and Galor have, however, been somewhat lucky in their enemies, in that they’ve been attacked by a bunch of anthropologists who have criticized them on political as well as scientific grounds. This gives the pair of economists the scientific and even moral high ground, in that they can feel that, unlike their antagonists, they are the true scholars, the ones pursuing truth wherever it leads them, letting the chips fall where they may. The real issue for me is that the chips aren’t quite falling the way Ashraf and Galor think they are. . . . The way to go is to start with the big pattern they noticed: the most genetically diverse countries (according to their measure) are in east Africa, and they’re poor. The least genetically diverse countries are remote undeveloped places like Bolivia and are pretty poor. Industrialized countries are not so remote (thus they have some diversity) but they’re not filled with east Africans (thus they’re not extremely genetically diverse). From there, you can look at various subsets of the data and perform various side analysis, as the authors indeed do for much of their paper.

And this post from a couple years later, where I wrote:

I continue to think that Ashraf and Galor’s paper is essentially an analysis of three data points (sub-Saharan Africa, remote Andean countries and Eurasia). It offered little more than the already-known stylized fact that sub-Saharan African countries are very poor, Amerindian countries are somewhat poor, and countries with Eurasians and their descendants tend to have middle or high incomes.

I asked if this was helpful.

Baron replied:

Yes and no. The second reference seems to state the problem, but in a way that is specific to this case. Yet I see this sort of thing all the time. I’m not concerned about the direction of causality. That is a separate problem. I think that the correlations across countries are highly deceptive when the countries fall into groups of very similar countries. When correlations are deceptive in this way, they are not useful for inferring any sort of causality, even if we can infer its direction from other considerations. In assessing the reliability of a correlation (by any method) it helps when N is higher. With an N of 35, a correlation can be clearly significant (for example) when the same correlation would be nowhere close to significant if the N is 3. Yet, if 35 countries fall into 3 groups of highly similar countries, the effective N is more like 3 than 35. Even worse, if the countries fall into two groups, you cannot even compute a true correlation at all. It only appears that you can when you use country as the unit of analysis. In the present study, this problem is exacerbated because the sampling of countries used, out of the population of all countries, is not at all random. The same problem occurs, however, when analyzing U.S. states, in studies that look at the entire population of states. Many of these correlations arise because states fall into groups: Confederacy, New England plus West Coast, flyover country. For example, I’m sure that “average humidity” correlates with “percent of evangelical Christians” across states, but that is really the result of the Confederacy alone, hence a historical accident. I guess an analogous problem occurs with time series. If “year” is the unit of analysis, you can get a nice correlation that is really the result of two linear trends over some period of time. (You can even use “day” and have a huge N.) I think this problem has been solved statistically. But I don’t know of any way of solving the problem of what might be called spatial or multi-dimensional similarity.

I don’t know that I’d say that the high percentage of evangelical Christians in the southern U.S. is a result of the Confederacy—maybe we’d still see if those states had never seceded and that war had never happened—but that’s not really the point here.

My response to Baron’s question is that you can deal with this sort of clustering by fitting a multilevel model including indicators for group as well as country. That said, this won’t solve the whole problem. As always, inferences depend on the specification of the model. In particular, including group indicators in your regression won’t necessarily resolve the problem. Ultimately I think you have to go to more careful models. For example, if you are comparing what’s going on in 35 countries, but they’re all in 3 groups, you might want to separately do analyses between and within groups.