Divisibility in statistics: Where is it needed?

The basics of Bayesian inference is p(parameters data) proportional to p(parameters)*p(data parameters). And, for predictions, p(predictions data) = integral_parameters p(predictions parameters,data)*p(parameters data).

In these expressions (and the corresponding simpler versions for maximum likelihood), “parameters” and “data” are unitary objects.

Yes, it can be helpful to think of the parameter objects as being a list or vector of individual parameters; and yes, it can be helpful to think of the data object as being a list of vector of individual data points (in the simplest case, an “iid model”), but this is not necessary to do Bayesian inference. Similarly, we can check our models using posterior predictive comparisons of “predictions” with “data” without needing to think formally about partitions of either of these into individual data points.

There are, however, some settings where partitioning is required: statistical concepts that make no sense without a principle of divisibility of the data.

  1. The first such setting is cross-validation. As Aki and I discuss in our papers here and here, concepts such as leave-one-out cross-validation and AIC are only defined in the context of divisions of the data. The “p(parameters data)” framework is not rich enough to encompass cross-validation. We don’t need iid data or even independent data, but we do need a structure in which “the data” are divided into N “data points.”

Once some division of the data is required, it makes us realize that the particular division we use is a choice: not a choice of the probability model, but a choice of how the model will be interpreted and evaluated. In a hierarchical model, we can choose to cross-validate on individual data points or on groups, and these two options have different implications.

  1. Another setting is exchangeability. A model can be spoken of as exchangeable only with respect to some list or sequence of variables. It does not make sense to say p(theta) is exchangeable; that property applies to p(theta_1,…,theta_J).

  2. Divisibility is also necessary for asymptotics. For example, consider the statement that it’s hard to compute the Ising model. It’s “hard to compute” as N increases. The Ising model with N=10 is easy to compute! To make statements about scalability, we need some increasing N, some divisibility of the data, or divisibility of the model, or both.

So what?

The point of this is that a Bayesian model is not just its posterior distribution. It’s also the sampling distribution (which is required for predictive checking, as discussed in chapter 6 of BDA) and also the division or partition of data.