Limitations of “Limitations of Bayesian Leave-One-Out Cross-Validation for Model Selection”

“If you will believe in your heart and confess with your lips, surely you will be saved one day” – The Mountain Goats paraphrasing Romans 10:9

One of the weird things about working with people a lot is that it doesn’t always translate into multiple opportunities to see them talk. I’m pretty sure the only time I’ve seen Andrew talk was at a fancy lecture he gave in Columbia. He talked about many things that day, but the one that stuck with me (because I’d not heard it phrased that well before, but as a side-note this is a memory of the gist of what he was saying. Do not hold him to this opinion!) was that the problem with p-values and null-hypothesis wasn’t so much that the procedure was bad. The problem is that people are taught to believe that there exists a procedure that can, given any set of data, produce a “yes/no” answer to a fairly difficult question. So the problem isn’t the specific decision rule that NHST produces, so much as the idea that a universally applicable decision rule exists at all. (And yes, I know the maths. But the problem with p-values was never the maths.)

This popped into my head again this week as Aki, Andrew, Yuling, and I were working on a discussion to Gronau and Wagenmakers’ (GW) paper “Limitations of Bayesian Leave-One-Out Cross-Validation for Model Selection”.

Our discussion is titled “Limitations of ‘Limitations of Bayesian Leave-One-Out Cross-Validation for Model Selection’” and it extends various points that Aki and I have made at various points on this blog.

To summarize our key points:

It is a bad thing for GW to introduce LOO model selection in a way that doesn’t account for its randomness. In their very specialized examples this turns out not to matter because they choose such odd data that the LOO estimates have zero variance. But it is nevertheless bad practice.

Stacking is a way to get model weights that is more in line with the LOO-predictive concept than GW’s ad hoc pseudo-BMA weights. Although stacking is also not consistent for nested models, in the cases considered in GW’s paper it consistently picks the correct model. In fact, the model weight for the true model in each of their cases is $w_0=1$ independent of the number of data points.

By not recognizing this, GW missed an opportunity to discuss the limitations of the assumptions underlying LOO (namely that the observed data is representative of the future data, and each individual data point is conditionally exchangeable). We spent some time laying these out and proposed some modifications to their experiments that would make these limitations clearer.
Because LOO is formulated under much weaker assumptions than is used in this paper, namely LOO does not assume that the data is generated by one of the models under consideration (the so-called “M-Closed assumption”), it is a little odd that GW only assess its performance under this assumption. This assumption almost never holds. If you’ve ever used the famous George Box quote, you’ve explicitly stated that the M-Closed assumption does not hold!
GW’s assertion that when two models can support identical models (such as in the case of nested models), the simplest model should be preferred is not a universal truth, but rather a specific choice that is being made. This can be enforced for LOO methods, but like all choices in statistical modelling, it shouldn’t be made automatically or by authority, but should instead be critically assessed in the context of the task being performed.

All of this has made me think about the idea of doing model selection. Or, more specifically, it’s made me question whether or not we should try to find universal tools for solving this problem. Is model selection even possible? (Danielle Navarro from UNSW has a particularly excellent blog post outlining her experiences with various existing model selection methods that you all should read.)

So I guess my very nebulous view is that we can’t do model selection, but we can’t not do model selection, but we also can’t not not do model selection.

In the end we need to work out how to do model selection for specific circumstances and to think critically about our assumptions. LOO helps us do some of that work.

To close off, I’m going to reproduce the final section of our paper because what’s the point of having a blog post (or writing a discussion) if you can’t have a bit of fun.

Can you do open science with M-Closed tools?

One of the great joys of writing a discussion is that we can pose a very difficult question that we have no real intention of answering. The question that is well worth pondering is the extent to which our chosen statistical tools influence how scientific decisions are made. And it’s relevant in this context because of a key difference between model selection tools based on LOO and tools based on marginal likelihoods is what happens when none of the models could reasonably generate the data. In this context, marginal likelihood-based model selection tools will, as the amount of data increases, choose the model that best represents the data, even if it doesn’t represent the data particularly well. LOO-based methods, on the other hand, are quite comfortable expressing that they can not determine a single model that should be selected. To put it more bluntly, marginal likelihood will always confidently select the wrong model, while LOO is able to express that no one model is correct. We leave it for each individual statistician to work out how the shortcomings of marginal likelihood-based model selection balance with the shortcomings of cross-validation methods. There is no simple answer.