“Identification of and correction for publication bias,” and another discussion of how forking paths is not the same thing as file drawer

Max Kasy and Isaiah Andrews sent along this paper, which begins:

Some empirical results are more likely to be published than others. Such selective publication leads to biased estimates and distorted inference. This paper proposes two approaches for identifying the conditional probability of publication as a function of a study’s results, the first based on systematic replication studies and the second based on meta-studies. For known conditional publication probabilities, we propose median-unbiased estimators and associated confidence sets that correct for selective publication. We apply our methods to recent large-scale replication studies in experimental economics and psychology, and to meta-studies of the effects of minimum wages and de-worming programs.

I sent them these comments:

  1. This recent discussion might be relevant. My quick impression is that this sort of modeling (whether parametric or nonparametric; I can see the virtues of both approaches) could be useful in demonstrating the problems of selection in a literature, and setting some lower bound on how bad the selection could be, but not so valuable if it were attempted to be used to come up with some sort of corrected estimate. In that way, it’s similar to our work on type M and type S errors, which are more of a warning than a solution. I think your view may be similar, in that you mostly talk about biases, without making strong claims about the ability to use your method to correct these biases.

  2. In section 2.1 you have your model where “a journal receives a stream of studies” and then decides which one to publish. I’m not sure how important this is for your mathematical model, but it’s my impression that most of the selection occurs within papers: a researcher gets some data and then has flexibility to analyze it in different ways, to come up with statistically significant conclusions or not. Even preregistered studies are subject to a lot of flexibility in interpretation of data analysis; see for example here.

  3. Section 2.1 of this paper may be of particular interest to you as it discusses selection bias and the overestimation of effect sizes in a study that was performed by some economists. It struck me as ironic that economists, who are so aware of selection bias in so many ways, have been naive in taking selected point estimates without recognizing their systematic problems. It seems that, for many economists, identification and unbiased estimation (unconditional on selection) serve as talismans which provide a sort of aura or blessing to an entire project, allowing the researchers to turn off their usual skepticism. Sad, really.

  4. You’ll probably agree with this point too: I think it’s a terrible attitude to say that a study replicates “if both the original study and its replication find a statistically significant effect in the same direction.” Statistical significance is close to meaningless at the best of times, but it’s particularly ridiculous here, in that using such a criterion is just throwing away information and indeed compounding the dichotomization that causes so many problems in the first place.

  5. I notice you cite the paper of Gilbert, King, Pettigrew, and Wilson (2016). I’d be careful about citing that paper, as I don’t think the authors of that paper knew what they were doing. I think that paper was at best sloppy and misinformed, and at worst a rabble-rousing, misleading bit of rhetoric, and I don’t recommend citing it. If you do want to refer to it, I really think you should point out that its arguments are what academics call “tendentious” and what Mark Twain called “stretchers,” and you could refer to this post by Brian Nosek and Elizabeth Gilbert explaining what was going on. As I’m sure you’re aware, the scientific literature gets clogged with bad papers, and I think we should avoid citing bad papers uncritically, even if they appear in what are considered top journals.

Kasy replied:

1) We wouldn’t want to take our corrections too literally, either, but we believe there is some value in saying “if this is how selection operates, this is how you would perform corrected frequentist inference (estimation & confidence sets)”Of course our model is not going to capture all the distortions that are going on, and so the resulting numbers should not be taken as literal truth, necessarily. 2) We think of our publication probability function p(.) as a “reduced form” object which is intended to capture selectivity by both researchers and journals. Our discussion is framed for concreteness in terms of journal decisions, but for our purposes journal decisions are indistinguishable from researcher decisions.In terms of assessing the resulting biases, I think it doesn’t really matter who performs the selection? 3) Agreed; there seems to be some distortion in the emphasis put by empirical economists.Though we’d also think that the focus on “internal validity” has led to improvements in empirical practice relative to earlier times? 4) Agree completely. We attempt to clarify this point in Section 3.3 of our paper.The key point to us seems to be that, depending on the underlying distribution of true effects across studies, pretty much any value for “Probability that the replication Z > 1.96, given that the original Z > 1.96” is possible (with a lower bound of 0.025), even in the absence of any selectivity.

Regarding Kasy’s response to point 2, “it doesn’t really matter who performs the selection?”, I’m not so sure.

Here’s my concern: in the “file drawer” paradigm, there’s some fixed number of results that are streaming through, and some get selected for publication. In the “forking paths” paradigm, there’s a virtually unlimited number of possible findings, so it’s not clear that it makes sense to speak of a latent distribution of results.

One other thing: in the “file drawer” paradigm, the different studies are independent. In the “forking paths” paradigm, the different possible results are all coming from the same data so they don’t represent so much additional information.