Using Stacking to Average Bayesian Predictive Distributions (with Discussion)

I’ve posted on this paper (by Yuling Yao, Aki Vehtari, Daniel Simpson, and myself) before, but now the final version has been published, along with a bunch of interesting discussions and our rejoinder.

This has been an important project for me, as it answers a question that’s been bugging me for over 20 years (since the time of the writing of the first edition of BDA), which is how to average inferences from several fitted Bayesian models. Traditional Bayesian model averaging doesn’t work (see chapter 7 of BDA3, or chapter 6 of the first two editions, for some discussion of why; as McAlinn, Aastveit, and West put it in their discussion of our new paper, “This is not a failure of Bayesian thinking, of course; the message remains follow-the-rules, but understand when assumptions fail to justify blind application of the rules.”), but it’s been frustrating to have no real alternative. Yes, I always say that continuous model expansion is the better approach, but sometimes you have a few models you can fit, and you want to do something.

Here, for example, is something I wrote on Bayesian model averaging a few years ago, back when I had no idea what to do.

I’m happy with our current solution, which is to consider the combination of inferences from several fitted models as a prediction problem, and use cross-validation to find optimal weights. This solution is not perfect—cross-validation can be noisy, and in any case we should be able to combine models more adaptively rather than estimating fixed weights—but I feel like it’s on the right track, and it gives us a practical method to get reasonable answers.

Here’s the abstract of our paper:

Bayesian model averaging is flawed in the M-open setting in which the true data-generating process is not one of the candidate models being fit. We take the idea of stacking from the point estimation literature and generalize to the com- bination of predictive distributions. We extend the utility function to any proper scoring rule and use Pareto smoothed importance sampling to efficiently compute the required leave-one-out posterior distributions. We compare stacking of pre- dictive distributions to several alternatives: stacking of means, Bayesian model averaging (BMA), Pseudo-BMA, and a variant of Pseudo-BMA that is stabilized using the Bayesian bootstrap. Based on simulations and real-data applications, we recommend stacking of predictive distributions, with bootstrapped-Pseudo-BMA as an approximate alternative when computation cost is an issue.

And the journal, Bayesian Analysis, published it with discussions by:

Bertrand ClarkeMeng LiPeter Grunwald and Rianne de HeideA. Philip DawidWilliam Weimin YooRobert L. Winkler, Victor Richmond R. Jose, Kenneth C. Lichtendahl Jr., and Yael Grushka-CockayneKenichiro McAlinn, Knut Are Aastveit, and Mike WestMinsuk ShinTianjian ZhouLennart Hoogerheide and Herman K. van DijkHaakon C. Bakka, Daniela Castro-Camilo, Maria Franco-Villoria, Anna Freni-Sterrantino, Raphael Huser, Thomas Opitz, and Havard RueMarco A. R. FerreiraLuis PericchiChristopher T. FranckEduard Belitser and Nurzhan NurushevMatteo Iacopini and Stefano Tonellato

and a rejoinder by us.