Do Bayesians Overfit?

TLDR: Yes, and there are precise results, although they are not as well known as they perhaps should be.

Over the last few years I had many conversations in which the statement was made that Bayesians methods are generally immune to overfitting, or at least, robust against overfitting, or—everybody would have to agree, right?—it clearly is better than maximum aposteriori estimation.

Various loose arguments in support include the built-in Bayesian version of Occam’s razor, and the principled treatment of any uncertainty throughout the estimation. However, over the years it has always bothered me that this argument is only made casually and for many years I was not aware of a formal proof or discussion except for the well-known result that in case the model is well-specified the Bayes posterior predictive is risk-optimal.

Until recently! A colleague pointed me to a book written by Sumio Watanabe (reference and thanks below) and this blog post is the result of the findings in this nice book.

In machine learning, the concept of overfitting is very important in practice. In fact, it is perhaps the most important concept to understand when learning from data. Many practices and methods aim squarely at measuring and preventing overfitting. The following are just a few examples:

Regularization limits the capacity of a machine learning model in order to avoid overfitting;

Separating data into a training, validation, and test set, is best practice to assess generalization performance and to avoid overfitting; Dropout, a regularization scheme for deep neural networks, is popularly used to mitigate overfitting.

But what is overfitting? Can we formally define it?

Defining Overfitting

The most widely used loose definition is the following.

Overfitting is the gap between the performance on the training set and the performance on the test set.

This definition makes a number of assumptions:

The data is independent and identically distributed and comes separated in a training set and a test set.
There is a clearly defined performance measure.
The test set is of sufficient size so that the performance estimation error is negligible.

For example, in a classification task the performance measure may be the classification error or the softmax-cross-entropy loss (log-loss).

However, in practice this definition of overfitting can be too strict: in many cases we care about minimizing generalization error, not about the difference between generalization error and training error. For deep learning in particular, the training error is often zero for the model that is selected as the one minimizing validation error. The recent paper (Belkin, Ma, Mandal, “To Understand Deep Learning We Need to Understand Kernel Learning”, ICML 2018) is studying this phenomenon.

Is overfitting relevant for Bayesians as well?

The Bayesian Case

(This paragraph summarizes Bayesian prediction and contains nothing new or controversial.)

Since de Finetti, a subjective Bayesian measures the performance of any model by the predicted likelihood of future observables. Given a sample (D_n=(x_1, \dots, x_n)), generated from some true data-generating distribution (x_i \sim Q), a Bayesian proceeds by setting up a model (P(x|\theta)), where (\theta) are unknown parameters of the model, with prior (P(\theta)). The data reveals information about (\theta) in the form of a posterior distribution (P(\theta|D_n)). The posterior distribution over parameters is then useful in constructing our best guess of what we will see next, in the form of the posterior predictive distribution,

$P(x | D_n) = \int P(x | \theta) \, P(\theta | D_n) \,\textrm{d}\theta.$

Note that in particular the only degrees of freedom are in the choice of model (P(x

\theta)) and in the prior (P(\theta)).

How good is (P(x|D_n))? A Bayesian cares about the predicted likelihood of future observables, which corresponds to the cross-entropy loss, and is also called the Bayesian generalization loss,

$B_g = -\mathbb{E}_{x \sim Q}[\log P(x|D_n)].$

Likewise, given our training sample (D_n), we can define the Bayesian training loss,

$B_t = - \frac{1}{n} \sum_{i=1}^n \log P(x_i | D_n).$

However, the concept of a “Bayesian training loss’’ is unnatural to a Bayesian because it uses the data twice: first, to construct the posterior predictive (P(x|D_n)), and then a second time, to evaluate the likelihood on (D_n). Nevertheless, we will see below that the concept, combined with the so called Gibbs training loss, is a very useful one.

The question of whether Bayesians overfit is then clearly stated as:

$B_t \ll B_g\,?$

We consider an elementary experiment of sampling data from a Normal distribution with unknown mean.

\begin{eqnarray} \mu & \sim & \mathcal{N}(\mu_0, \sigma^2_0),
x_i & \sim & \mathcal{N}(\mu, \sigma^2), \qquad i=1,\dots,n. \end{eqnarray}

In this case, exact Bayesian inference is feasible because the posterior and posterior-predictive distributions have a simple closed-form solution, each of which is a Normal distributions.

For varying sample size (n) we perform 2,000 replicates of generating data according to the above sampling procedure and evaluate the Bayesian generalization loss and the Bayesian training loss. The following plot shows the average errors over all replicates.

Clearly (B_t < B_g), and there is overfitting.

What about non-Bayesian estimators, such as MAP estimation and maximum likelihood estimation?

Maximum Aposteriori (MAP) and Maximum Likelihood (MLE)

Two popular point estimators are the maximum aposteriori estimator (MAP), defined as

$\hat{\theta}_{\textrm{map}} = \textrm{argmax}_{\theta} P(\theta | D_n),$

and the maximum likelihood estimator (MLE), defined as

$\hat{\theta}_{\textrm{mle}} = \textrm{argmax}_{\theta} \sum_{i=1}^n \log P(x_i|\theta).$

Each of these estimators also has a generalization loss and a training loss. In our experiment the MLE estimator is dominated by the MAP estimator, which is in turn dominated by the Bayesian estimate, which is optimal in terms of generalization loss.

The gap between the MLE generalization error (top line, dotted) and the MAP generalization error (black dashed line) is due to the use of the informative prior about (\mu). The gap between the Bayesian generalization error (black solid line) and the MAP generalization error (black dashed line) is due to the Bayesian handling of estimation uncertainty. In this simple example the information contained in the prior is more important than the Bayesian treatment of estimation uncertainty.

Can we estimate (B_g) except via prediction on hold-out data?

It turns out that we can estimate (B_g) to order (O(n^{-2})) from just our training set. This is useful because it provides us an estimate of our generalization performance, and hence can be used for model selection and hyperparameter optimization.

The Widely Applicable Information Criterion (WAIC), invented by Sumio Watanabe, estimates the Bayesian generalization error,

$\textrm{WAIC} = B_t + 2(G_t - B_t),$

where (G_t) is the Gibbs training loss, defined as the average loss of individual models from the posterior,

$G_t = -\mathbb{E}_{\theta \sim P(\theta|D_n)}\left[\frac{1}{n} \sum_{i=1}^n \log P(X_{n+1} = x_i|\theta)\right].$

Due to Jensen’s inequality we always have (G_t > B_t) so the right hand summand in (\textrm{WAIC}) is always positive. Importantly, given a training set we can actually evaluate (\textrm{WAIC}), but we cannot evaluate (B_g).

Watanabe showed that

$\mathbb{E}[B_g] = \mathbb{E}[\textrm{WAIC}] + O(n^{-2}).$

Evaluating the previous experiment we can see that (\textrm{WAIC}) indeed accurately estimates (B_g).

Even better, Watanabe also showed that (\textrm{WAIC}) continues to estimate the Bayesian generalization error accurately in singular models and in case the model is misspecified. Here, singular means that there is not a bijective map between model parameters and distributions. Misspecified means that no parameter exists which matches the true data-generating distribution.

WAIC with Approximate Posteriors

The above definition of (\textrm{WAIC}) assumes an exact Bayesian posterior. In practice we may not have the luxury of being able to compute the exact posterior, and instead use approximate inference methods such as Markov chain Monte Carlo (MCMC) to get sample based approximations to the posterior, or variational Bayes (VB) to get approximations within a parametric family of distributions.

For such approximations WAIC can degenerate considerably. For example, consider a posterior family

$\mathcal{U}_v := \{ \mathcal{N}(\mu, C) \, | \, \mu \in \mathbb{R}^d, \, 0 \prec C \prec vI \},$

where a scalar (v > 0) bounds the eigenvalues of (C) from above. Doing variational Bayes with (\mathcal{U}_{\epsilon}) then corresponds to MAP estimation and the difference (G_t - B_t) will be close to zero, yet (B_t) can be very small. In this case, applying the (\textrm{WAIC}) equation would suggest that (B_g \approx B_t), so we severely underestimate the Bayesian generalization loss.

The same holds true for MCMC: even if (\theta^{(1)}, \dots, \theta^{(K)}) would be exact samples from (P(\theta|D_n)) and we approximate the exact posterior by the set of these parameters, the estimate of (B_t) is now too large so (G_t - B_t) is underestimated.

Clearly, Bayesians do overfit, just like any other procedure does.

The following is a list of relevant literature with some comments.

Acknowledgements. I thank Ryota Tomioka for exciting discussions and for pointing me to Watanabe’s book. Thanks also to Ferenc Huszár and Richard Turner for feedback on a draft of the article and to Vitaly Kurin and Artem for a correction.