Parsimonious principle vs integration over all uncertainties

tl;dr If you have bad models, bad priors or bad inference choose the simplest possible model. If you have good models, good priors, good inference, use the most elaborate model for predictions. To make interpretation easier you may use a smaller model with similar predictive performance as the most elaborate model.

Merijn Mestdagh emailed me (Aki) and asked

In your blogpost “Comments on Limitations of Bayesian Leave-One-Out Cross-Validation for Model Selection” you write that “My recommendation is that if LOO comparison taking into account the uncertainties says that there is no clear winner”…“In case of nested models I would choose the more complex model to be certain that uncertainties are not underestimated”. Could you provide some references or additional information on this claim?

I have to clarify additional conditions for my recommendation for using the encompassing model in the case of nested model

models are used to make predictions the encompassing model has passed model checking the inference has passed diagnostic checks

The Bayesian theory says that we should integrate over the uncertainties. The encompassing model includes the submodels, and if the encompassing model has passed model checking, then the correct thing is to include all the models and integrate over the uncertainties (and I assume that inference is correct). To pass the model checking, it may require good priors on the model parameters, which maybe something many ignore and then they may get bad performance with more complex models. If the models have similar loo performance, the encompassing model is likely to have thicker tails of the predictive distribution, meaning it is more cautious about rare events. I think this is good.

The main reasons why it is so common to favor more parsimonious models

The maximum likelihood inference is common and it doesn’t work well with more complex models. Favoring the simpler models is a kind of regularization. Bad model misspecification. Even Bayesian inference doesn’t work well if the model is bad, and with complex models there are more possibilities for misspecifing the model and the misspecification even in one part can have strange effects in other parts. Favoring the simpler models is a kind of regularization. Bad priors. Actually priors and model are inseparable, so this is kind of same as the previous one. It is more difficult to choose good priors for more complex models, because it’s difficult for humans to think about high dimensional parameters and how they affect the predictions. Favoring the simpler models can avoid the need to think harder about priors. See also Dan’s blog post and Mike’s case study.

But when I wrote my comments, I assumed that we are considering sensible models, priors, and inference, so there is no need for parsimony. See also paper Occam’s razor illustrating the effect of good priors and increasing model complexity.

In “VAR(1) based models do not always outpredict AR(1) models in typical psychological applications.” https://www.ncbi.nlm.nih.gov/pubmed/29745683 we compare AR models with VAR models (AR models are nested in VAR models). When both models perform equally we prefer, in contrast to your blogpost, the less complex AR models because – The one standard error rule (“ Hastie et al. (2009) propose to select the most parsimonious model within the range of one standard error above the prediction error of the best model (this is the one standard error rule)”). Hastie is not using priors or Bayesian inference, and thus he needs the parsimony rule. He may also have implicit cost function… – A big problem (in my opinion) in psychology is the over interpretation of small effects which exacerbates by using complex models. I fear that some researchers will feel the need to interpret the estimated value of every parameter in the complex model.

Yes, it is dangerous to over-interpret especially if the estimates don’t include good uncertainty estimates, and even then it’s difficult to make interpretations in case of many collinear covariates.

My assumption above was that the model is used for predictions and I care about best possible predictive distribution.

The situation is different if we add cost of interpretation or cost of measurements in the future. I don’t know literature analysing what is a cost of interpretation for AR vs VAR model, or cost of of interpretation of additional covariate in a model, but when I’m considering interpretability I favor smaller models with similar predictive performance as the encompassing model. But if we have the encompassing model, then I also favor projection predictive approach where the full model is used as the reference so that the selection process is not overfitting to the data and the inference given the smaller model is conditional the information from the full model (Piironen & Vehtari, 2017). In case of a small number of models, LOO comparison or Bayesian stacking weights can also be sufficient (some examples here and here).