Multilevel models for multiple comparisons! Varying treatment effects!

Mark White writes:

I have a question regarding using multilevel models for multiple comparisons, per your 2012 paper and many blog posts. I am in a situation where I do randomized experiments, and I have a lot of additional demographic information about people, as well. For the moment, let us just assume that all of these are categorical demographic variables. I want to not only know if there is an effect of the treatment over the control—but for what groups there is an effect (positive or negative) for. I never get too granular, but I do look at an intersection between two variables (e.g., Black men, younger married people, Republican women) as well as just within one variable (e.g., women, Republicans, married people). The issue I’m running into is that I want to look at the effects for all of these groups, but I don’t want to get mired down by Type I error and go chasing noise. (I know you reject the Type I error paradigm because a null of precisely zero is a straw-man argument, but clients and other stakeholders still want to be sure we aren’t reading too much into something that is not there.) In the machine learning literature, there is a growing interest in causal inference and now a whole topic called “heterogeneous treatment effects.” In the general linear model world in which I was taught as a psychologist, this could also just be called “looking for interactions.” Many of these methods are promising, but I’m finding them difficult to implement in my scenario (I wrote a question here https://stats.stackexchange.com/questions/341402/a-few-questions-regarding-the-practice-of-heterogeneous-treatment-effect-analysi and posed a tailored question about one package to package creators directly here https://github.com/swager/grf/issues/238). Turning back to multilevel models, it seems like I could do this in that framework. Basically, I just create a non-nested/crossed/whatever you’d like to call it model where people are nested in k groups, where k refers to how many demographic variables I have. I simulated data and fit a model here: https://gist.github.com/markhwhiteii/592d40f93b052663f240125fc9b8db99 The questions I have for you are the questions I pose at the bottom of that R script at the GitHub code snippet:

Is this a reasonable approach to examine “heterogenous treatment effects” without getting bogged down by Type I error and multiple comparison problems?

If it is, how can I get confidence intervals from the fitted model object using glmer? You all do so in the 2012 paper, I believe

More importantly, how can I look at the intersection between two groups? The code I sent in that GitHub snippet looks at effects for men, women, Blacks, Whites, millennials, etc. But I coded in an effect for Black men specifically. How could I use that fitted model object to examine the effect for Black men, White women, millennials with kids, etc.? And how would I calculate standard errors for these?

Would all of these things be easier to do in Stan? What would that Stan model look like? Since then I wouldn’t have to figure out how to calculate standard errors for everything, but just sample from the posterior.

My reply:

We’ve been talking about varying treatment effects for a long time. (“Heterogeneous” is jargon for “varying,” I think.)

From 2004: Treatment effects in before-after data.

From 2008: Estimating incumbency advantage and its variation, as an example of a before/after study.

From 2015: The connection between varying treatment effects and the crisis of unreplicable research: A Bayesian perspective.

From 2015: Hierarchical models for causal effects.

From 2015: The connection between varying treatment effects and the well-known optimism of published research findings.

From 2017: Let’s accept the idea that treatment effects vary—not as something special but just as a matter of course.

I definitely think hierarchical modeling is the way to go here. Think of it as a regression model, in which you’re modeling (predicting) treatment effects given pre-treatment predictors, so the treatment could be more effective for men than for women, or for young people than for old people, etc. You’ll end up with lots of predictors in this regression, and multilevel modeling is a way to control or regularize their coefficients.

In short, the key virtue of multilevel modeling (or some other regularization approach) here is that it allows you to include more predictors in your regression. Without regularization, your estimates would become too noisy, then you’d have to fit a cruder model, not allowing you to study the variation that you care about.

The other thing is, yeah, forget type 1 error rates and all the rest. Abandon the idea that the goal of the statistical analysis is to get some sort of certainty. Instead, accept posterior ambiguity: don’t try to learn more from the data than you really can.

I’ll start with some models in lme4 (or rstanarm) notation. Suppose you have a treatment z and pre-treatment predictors x1 and x2. Then here are some models:

If you have predictors x3 and x4 with multiple levels:

One thing we’re still struggling with, is that there are all these possible models. Really we’d like to start and end with the full model, something like this, with all the interactions:

But these models can be hard to handle. I think we need stronger priors, stronger than the current defaults in rstanarm. So for now I’d build up from the simple model, including interactions as appropriate.

In any case, you can get posterior uncertainties for whatever you want from stan_glmer() in rstanarm; simulations of all the parameters are directly accessible from the fitted object.

You can also aggregate however you want. It’s mathematically the same as Mister P; you’re just working with treatment effects rather than averages.