Data concerns when interpreting comparisons of gender equality between countries

A journalist pointed me to this research article, “Gender equality and sex differences in personality: evidence from a large, multi-national sample,” by Tim Kaiser (see also news report by Angela Lashbrook here), which states:

A large, multinational (N = 926,383) dataset was used to examine sex differences in Big Five facet scores for 70 countries. Difference scores were aggregated to a multivariate effect size (Mahalanobis’ D). . . . Countries’ difference scores were related to an index of gender equality, revealing a positive weighted correlation of r = .39 . . . Using multivariate effect sizes derived from latent scores with invariance constraints, the study of sex differences in personality becomes more robust und replicable. Sex differences in personality should not be interpreted as results of unequal treatment . . .

The journalist wrote, “This study found that as gender equality increases, so do gender differences. Have you seen evidence of this in research, including your own research?”

I replied as follows:

I have not worked in this area of gender equality myself, so I can’t say I’ve seen this pattern, or not seen it, in my own research. This particular study linked above has a lot of details; the main pattern appears in its figure 2. I have no sense of why this “Mahanalobis D” number is so much higher in Latvia than Iran, but I could well imagine this could be an artifact of survey responses. In general, if the survey responses are noisier, you’d expect a measure such as this D to be closer to zero. If you look in the paper, you’ll see that D is based on various personality inventories, and I could well imagine that these responses would have different interpretations in Latvia, Iceland, and Finland than in Iran, Mexico, and Jamaica. In addition, it says in the paper that “the assessment procedures have selected for English language subjects with Internet access.” This would seem to destroy all their interpretations of the results when comparing countries. So I would be wary of taking these results too seriously. That said, there’s nothing wrong with speculation, as long as it is clearly labeled as such. Finally, I’m skeptical about the following claim made in this paper: “The degree to which a society allows individuals to express biological gender differences can vary. If a society ensures that men and women have exactly the same access to all resources that this society has to offer, the biological factors could be expressed more strongly than in more repressive societies. A stronger sexual dimorphism should therefore be seen more as an expression of a successful gender policy.” I don’t see how the results in their paper—even if you ignore any potential data issues and assume the survey responses have the same meanings in each country—lead to this conclusion. Here’s another line from the paper: “The results presented suggest that greater sexual dimorphism should not be interpreted as an indicator of a society that discriminates against a particular sex, but rather as an indicator of a successful gender equality policy.” I don’t understand this at all. Even taking the data at face value, they have two measures: D (the measure of sex differences in personality) and GGGI (the measure of lack of sex differences in outcomes in the country). D is weakly correlated with GGGI. Fine. But then a high value of D is a measure of a high value of D; it’s not an indicator of GGGI. For that matter, GGGI is not a measure of “policy” either. Finally, I am concerned about some of the details in the paper, for example on page 7, Antarctica, Puerto Rico, and Andorra are listed as countries. Andorra, maybe although I can’t imagine we’d learn much from such an odd case. Puerto Rico is of course not a country, and Antarctica even less so. Anyway, my quick reaction is that it’s a good step for these findings to be published but I think they are being way overinterpreted.

I sent the above comments to Laurie Rudman, a researcher who works in related areas, and she agreed that assessing and analyzing cross-cultural data can be difficult. Rudman pointed me to this paper, “Mind the level: problems with two recent nation-level analyses in psychology,” by Toon Kuppens, and Thomas Pollet, that raises some of these issues (although not with the Kaiser paper discussed above).

P.S. The journalist asked for some clarifications, so I added these points:

What is the Mahalanobis D? When comparing two groups (for example. male and female survey respondents) on a single variable (for example, height), you can just take the difference and report, say, that on average men are one standard deviation taller than women, or whatever it is. When comparing two groups on multiple variables (for example, several different personality assessment survey responses), you can construct a “multivariate distance measure.” Mahalanobis D is one such measure. Bigger numbers correspond to larger average differences between men and women in whatever variables are measured. In the above-linked paper, my issue with the Mahalanobis D was not how it was defined, but rather with the data used to compute it. I’m concerned (a) that the responses to the survey questions will have different meanings in different countries, (b) that there will be nonresponse bias due to the overrepresentation of English-speaking internet users, and (c) that these issues will be correlated with key variables in the study.
Regarding the point about biological differences: Part of my concern is that I don’t really know what is meant by “biological gender differences” in this context. My other concern is that the conclusions of the paper relate to things not measured in the paper. For example, the paper refers to discrimination, but there’s nothing in the data about discrimination. And the paper refers to a gender equality policy, but there’s nothing in the paper about gender equality policies. Just in general, I’m wary about conclusions that don’t directly connect to the data.