Joël Gombin writes:

I’m wondering what your take would be on the following problem. I’d like to model a proportion (e.g., the share of the vote for a given party at some territorial level) in function of some compositional data (e.g., the sociodemographic makeup of the voting population), and this, in a multilevel fashion (allowing the parameters to vary across space – or time, for that matter). Now, I know I should use a beta distribution to model the response. What I’m less certain of, is how I should deal with the compositional data that makes the predictors. Indeed, if I try to use them in a naive model, they don’t satisfy the independence hypothesis, create multicollinearity and are difficult to interpret. The marginal effects don’t make sense since whenever one variable goes up, other should go down (and at any rate can’t be considered as constant). Until now my strategy has been to remove one of the predictors or to remove the constant. But obviously this is not satisfactory, and one can do better. Another strategy I’ve used, and I think it was inspired by you, was to center-scale the predictors. I wonder what you think about that. However I’m sure one can do better than that. I’ve read about different strategies, for example using principal components or log ratios, but what bothers me with this kind of solution is that I find it very difficult to interpret the parameters when they are transformed this way. So I wondered what startegy you would use in such a case.

My reply:

First off, no, there’s no reason you “should use a beta distribution to model the response.” I know that we often operate this way in practice—using the range of the data space to determine a probability model—but that’s just a convenience thing, certainly not a “should”! For modeling vote shares, you can sometimes get away with a simple additive model, if the proportions are never or rarely close to 0 or 1; another option is a linear model of logistic-transformed vote shares. In any case, if you have zeroes (candidates getting no votes or essentially no votes, or parties not running a candidate in some races), you’ll want to model that separately.

Now for the predictors. The best way to set up your regression model depends on how you’d like to model y given x.

It’s not necessarily wrong to simply fit a linear regression of y on the components of x. For example, suppose x is x1, x2, x3, x4, with the constraint x1 + x2 + x3 + x4 = 1. It could be that the model y = b1*x1 + b2*x2 + b3*x3 + b4*x4 + error will be reasonable. Indeed, there is no “independence hypothesis” that regression coefficients are expected to satisfy. You can just put in the 4 predictors and remove the constant term (unnecessary since they sum to 1) and go from there.

That said, your model might be more interpretable if you reparameterize. How best to do this will depend on context.