Response to Rafa： Why I don’t think ROC [receiver operating characteristic] works as a model for science

Someone pointed me to this post from a few years ago where Rafael Irizarry argues that scientific “pessimists” such as myself are, at least in some fields, “missing a critical point: that in practice, there is an inverse relationship between increasing rates of true discoveries and decreasing rates of false discoveries and that true discoveries from fields such as the biomedical sciences provide an enormous benefit to society.” So far so good—within the framework in which the goal of p-value-style science is to make “discoveries” and in which these discoveries can be characterized as “true” or “false.”

But I don’t see this framework as being such a useful description of science, or at least the sort of science for which statistical hypothesis tests, confidence intervals, etc., are used. Why do I say this? Because I see the following sorts of statistical analysis:

– Parameter estimation and inference, for example estimation of a treatment effect. The goal of a focused randomized clinical trial is not to make a discovery—any “discovery” to be had was made before the study began, in the construction of the new treatment. Rather, the goal is to estimate the treatment effect (or perhaps to demonstrate that the treatment effect is nonzero, which is a byproduct of estimation).

– Exploration. That describes much of social science. Here one could say that discoveries are possible, and even that the goal is discovery, but we’re not discovering statements that are true or false. For example, in our red-state blue-state analysis we discovered an interesting and previously unknown pattern in voting—but I don’t see the ROC framework being applicable here. It’s not like it would make sense to say that if our coefficient estimate or z-score or whatever is higher than some threshold that we declare the pattern to be real, otherwise not. Rather, we see a pattern and use statistical analysis (multilevel modeling, partial pooling, etc.), to give our best estimate of the underlying voting patterns and of our uncertainties. I don’t see the point of dichotomizing: we found an interesting pattern, we did auxiliary analyses to understand it, it can be further studied using new data on new elections, etc.

– OK, you might say that this is fine in social science, but if you’re the FDA you have to approve or not approve a new drug, and if you’re a drug company you have to decide to proceed with a candidate drug or give it up. Decisions need to be made. Sure, but here I’d prefer to use formal decision analysis with costs and benefits. If this points us toward taking more risks—for example, approving drugs whose net benefit remains very uncertain—so be it. This fits Rafael’s ROC story, but not based on any fixed p-value or posterior probability; see my paper with McShane et al.

Also, again, the discussion of “false positive rate” and “true positive rate” seems to miss the point. If you’re talking about drugs or medical treatments: well, lots of them have effects, but the effects are variable, positive for some people and negative for others.

– Finally, consider the “shotgun” sort of study in which a large number of drugs, or genes, or interactions, or whatever, are tested, and the goal is to discover which ones matter. Again, I’d prefer a decision-theoretic framework, moving away from the idea of statistical “discovery” toward mere “inference.”

What’s the practical implications for all this? It’s good for researchers to present their raw data, along with clean summary analyses. Report what your data show, and publish everything! But when it comes to decision making, including the decision of what lines of research to pursue further, I’d go Bayesian, incorporating prior information and making the sources and reasoning underlying that prior information clear, and laying out costs and benefits. Of course that’s all a lot of work, and I don’t usually do it myself. Look at my applied papers and you’ll see tons of point estimates and uncertainty intervals, and only a few formal decision analyses. Still, I think it makes sense to think of Bayesian decision analysis as the ideal form and to interpret inferential summaries in light of these goals. Or, even more short term than that, if people are using statistical significance to make publication decisions, we can do our best to correct for the resulting biases, as in section 2.1 of this paper.