The scandal isn’t what’s retracted, the scandal is what’s not retracted.

Andrew Han at Retraction Watch reports on a paper, “Structural stigma and all-cause mortality in sexual minority populations,” published in 2014 by Mark Hatzenbuehler, Anna Bellatorre, Yeonjin Lee, Brian Finch, Peter Muennig, and Kevin Fiscella, that claimed:

Sexual minorities living in communities with high levels of anti-gay prejudice experienced a higher hazard of mortality than those living in low-prejudice communities (Hazard Ratio [HR] = 3.03, 95% Confidence Interval [CI] = 1.50, 6.13), controlling for individual and community-level covariates. This result translates into a shorter life expectancy of approximately 12 years (95% C.I.: 4–20 years) for sexual minorities living in high-prejudice communities.

Hatzenbuehler et al. attributed some of this to “an 18-year difference in average age of completed suicide between sexual minorities in the high-prejudice (age 37.5) and low-prejudice (age 55.7) communities,” but the whole thing still doesn’t seem to add up. Suicide is an unusual cause of death. To increase the instantaneous probability of dying (that is, the hazard rate) by a factor of 3, you need to drastically increase the rate of major diseases, and it’s hard to see how living in communities with high levels of anti-gay prejudice would do that, after controlling for individual and community-level covariates.

Part 1. The original paper was no good.

Certain aspects of the design of the study were reasonable: they used some questions on the General Social Survey (GSS) about attitudes toward sexual minorities, aggregated these by geography to identify areas which, based on these responses, corresponded to positive and negative attitudes, then for each area they computed the mortality rate of a subset of respondentx in that area, which they were able to do using “General Social Survey/National Death Index (GSS-NDI) . . . a new, innovative prospective cohort dataset in which participants from 18 waves of the GSS are linked to mortality data by cause of death.” They report that “Of the 914 sexual minorities in our sample, 134 (14.66%) were dead by 2008.” (It’s poor practice to call this 14.66% rather than 15%—it would be kinda like saying that Steph Curry is 6 feet 2.133 inches tall—but this is not important for the paper, it’s only an indirect sign of concern as it indicates a level of innumeracy on the authors’ part to have let this slip into the paper.)

The big problem with this paper—what makes it dead in arrival even before any of the data are analyzed—is that you’d expect most of the deaths to come from heart disease, cancer, etc., and the most important factor predicting death rates will be the age of the respondents in the survey. Any correlation between age of respondent and their anti-gay prejudice predictor will drive the results. Yes, they control for age in their analysis, but throwing in such a predictor won’t solve the problem. The next things to worry about are sex and smoking status, but really the game’s already over. To look at aggregate mortality here is to be trying to find a needle in a haystack. And all this is exacerbated by the low sample size. It’s hard enough to understand mortality-rate comparisons with aggregate CDC data. How can you expect to learn anything from a sample of 900 people?

Once the analysis comes in, the problem becomes even more clear, as the headline result is an 3-fold increase in the hazard rate, which is about as plausible as that claim we discussed a few years ago that women were 3 times more likely to wear red or pink clothing during certain times of the month, or that claim we discussed a few years before that beautiful parents were 36 percent more likely to have girls.

Beyond all that—and in common with these two earlier studies—this new paper has forking paths, most obviously in that the key predictor is not the average anti-gay survey response by area, but rather an indicator for whether this is in the top quartile.

To quickly summarize:1. This study never had a chance of working.2. Once you look at the estimate, it doesn’t make sense.3. The data processing and analysis had forking paths.

Put this all together and you get: (a) there’s no reason to believe the claims in this paper, and (b) you can see you the researchers obtained statistical significance, which fooled them and the journal editors into thinking that their analysis told them something generalizable to the real world.

Part 2. An error in the data processing.

So far, so boring. A scientific journal publishes a paper that makes bold, unreasonable claims based on a statistical analysis of data that could never realistically have been hoped to support the analysis. Hardly worth noting in Retraction Watch.

No, what made the story noteworthy was the next chapter, when Mark Regnerus published a paper, “Is structural stigma’s effect on the mortality of sexual minorities robust? A failure to replicate the results of a published study,” reporting:

Hatzenbuehler et al.’s (2014) study of structural stigma’s effect on mortality revealed an average of 12 years’ shorter life expectancy for sexual minorities who resided in communities thought to exhibit high levels of anti-gay prejudice . . . Attempts to replicate the study, in order to explore alternative hypotheses, repeatedly failed to generate the original study’s key finding on structural stigma.

Regnerus continues:

The effort to replicate the original study was successful in everything except the creation of the PSU-level structural stigma variable.

Regnerus attributes the problem to the authors’ missing-data imputation procedure, which was not clearly described in the original paper. It’s also a big source of forking paths:

Minimally, the findings of Hatzenbuehler et al. (2014) study of the effects of structural stigma seem to be very sensitive to subjective decisions about the imputation of missing data, decisions to which readers are not privy. Moreover, the structural stigma variable itself seems questionable, involving quite different types of measures, the loss of information (in repeated dichotomizing) and an arbitrary cut-off at a top-quartile level. Hence the original study’s claims that such stigma stably accounts for 12 years of diminished life span among sexual minorities seems unfounded, since it is entirely mitigated in multiple attempts to replicate the imputed stigma variable.

Regnerus also points out the sensitivity of any conclusions to confounding in the observational study. Above, I mention age, sex, and smoking status; Regnerus mentions ethnicity as a possible confounding variable. Again, controlling for such variables in a regression model is a start, but only a start; it is a mistake to think that if you throw a possible confounder into a regression, that you’ve resolved the problem.

A couple months later, a correction note by Hatzenbuehler et al. appeared in the journal:

Following the publication of Regnerus’s (2017) paper, we hired an independent research group . . . A coding error was discovered. Specifically, the data analyst mis-specified the time variable for the survival models, which incorrectly addressed the censoring for individuals who died. The time variable did not correctly adjust for the time since the interview to death due to a calculation error, which led to improper censoring of the exposure period. Once the error was corrected, there was no longer a significant association between structural stigma and mortality risk among the sample of 914 sexual minorities.

I can’t get too upset about this particular mistake, given that I did something like that myself a few years back!

Part 3. Multiple errors in the same paper.

But here’s something weird: The error reported by Hatzenbuehler et al. in their correction is completely unrelated to the errors reported by Regnerus. The paper had two completely different data problems, one involving imputation and the construction of the structural stigma variable, and another involving the measurement of survival time. And these both errors are distinct from the confounding problem and the noise problem.

How did this all happen? It’s surprisingly common to see multiple errors in a single published paper. It’s rare for authors to reach the frequency of a Brian Wansink (150 errors in four published papers, and that was just the start of it) or a Richard Tol (approaching the Platonic ideal of more errors in a paper than there are data points), but multiple unrelated errors—that happens all the time.

How does this happen? Here are a few reasons:

Satisficing. The goal of the analysis is to obtain statistical significance. Once that happens, there’s not much reason to check.
Over-reliance on the peer-review process. This happens in two ways. First, you might be less careful about checking your manuscript, knowing that three reviewers will be going through it anyway. Second—and this is the real problem—you might assume that reviewers will catch all the errors, and then when the paper is accepted by the journal you’d think that no errors remain.
A focus on results. The story of the paper is that prejudice is bad for your health. (See for example this Reuters story by Andrew Seaman, linked to from the Retraction Watch article.) If you buy the conclusion, the details recede.

There’s no reason to think that errors in scientific papers are rare, and there’s no reason to be surprised that a paper with one documented error will have others.

Let’s give credit where due. It was a mistake of the journal Social Science and Medicine to publish that original paper, but, on the plus side, they published Regnerus’s criticism. Much better to be open with criticism than to play defense.

I’m not so happy, though, with Hatzenbuehler et al.’s correction notice. They do cite Regnerus, but they don’t mention that the error he noticed is completely different from the error they report. Their correction notice gives the misleading impression that their paper had just this one error. That ain’t cool.

Part 4. The reaction.

We’ve already discussed the response of Hatzenbuehler et al., which is not ideal but I’ve seen worse. At least they admitted one of their mistakes.

In the Reaction Watch article, Han linked to some media coverage, including this news article by Maggie Gallagher in the National Review with headline, “A widely reported study on longevity of homosexuals appears to have been faked.” I suppose that’s possible, but my guess is that the errors came from sloppiness, plus not bothering to check the results.

Han also reports:

Nathaniel Frank, a lawyer who runs the What We Know project, a catalog of research related to LGBT issues housed at Columbia Law School, told Retraction Watch:

Mark Regnerus destroyed his scholarly credibility when, as revealed in federal court proceedings, he allowed his ideological beliefs to drive his conclusions and sought to obscure the truth about how he conducted his own work. There’s an enormous body of research showing the harms of minority stress, and Regnerus is simply not a trustworthy critic of this research.

Hey. Don’t shoot the messenger. Hatzenbuehler et al. published a paper that had multiple errors, used flawed statistical techniques, and came up with an implausible conclusion. Frank clearly thinks this area of research is important. If so, he should be thanking Regnerus for pointing out problems with this fatally flawed paper and for motivating Hatzenbuehler et al. to perform the reanalysis that revealed another mistake. Instead of thanking Regnerus, Frank attacks him. What’s that all about? You can dislike a guy but that shouldn’t make you dismiss his scientific contribution. Yes, maybe sometimes it takes someone predisposed to be skeptical of a study to go to the trouble to find its problems. We should respect that.

Part 5. The big picture.

As the saying goes, the scandal isn’t what’s illegal, the scandal is what’s legal. Michael Kinsley said this in the context of political corruption. Here I’m not talking about corruption or even dishonesty, just scientific error—which includes problems in data coding, incorrect statistical analysis, and, more generally, grandiose claims that are not supported by the data.

This particular paper has been corrected, but lots and lots of other papers have not, simply because they have no obvious single mistake or “smoking gun.”

The scandal isn’t what’s retracted, the scandal is what’s not retracted. All sorts of papers whose claims aren’t supported by their data.

The Retraction Watch article reproduces the cover of the journal Social Science and Medicine:

Hmmm . . . what’s that number in the circle on the bottom right? “2.742,” it appears? I couldn’t manage to blow up this particular image but I found something very similar in high resolution on Wikipedia:

“Most Cited Social Science Journal” “Impact Factor 2.814.”

I have no particular problem with the journal, Social Science and Medicine. They just happened to have the good fortune that, after publishing a problem with major problems, they were able to have an excuse to correct it. Standard practice is, unfortunately, for errors not to be noticed in the published record, or, when they are, for the authors of the paper to reply with denials that there was ever a problem.