Alex Konkel writes on a topic that never goes out of style:
I’m working on a data analysis plan and am hoping you might help clarify something you wrote regarding missing data. I’m somewhat familiar with multiple imputation and some of the available methods, and I’m also becoming more familiar with Bayesian modeling like in Stan. In my plan, I started writing that we could use multiple imputation or Bayesian modeling, but then I realized that they might be the same. I checked your book with Jennifer Hill, and the multiple imputation chapter says that if outcome data are missing (which is the only thing we’re worried about) and you use a Bayesian model, “it is trivial to use this model to, in effect, impute missing values at each iteration”. Should I read anything into that “in effect”? Or is the Bayesian model identical to imputation? A second, trickier question: I expect that most of our missing data, if we have any, can be considered missing at random (e.g., the data collection software randomly fails). One of our measures, though, is the participant rating their workload at certain times during task performance. If the participant misses the prompt and fails to respond, it could be that they just missed the prompt and nothing unusual was happening, and so I would consider that data to be missing at random. But participants could also fail to respond because the task is keeping them very busy, in which case the data are not missing at random but in fact reflect a high workload (and likely higher than other times when they do respond). Do you have any suggestions on how to handle this situation?
My reply:
to paragraph 1: Yes, what we’re saying is that if you just exclude the missing cases and fit your model, this is equivalent to the usual statistical inference, assuming missing at random. And then if you want inference for those missing values, you can impute them conditional on the fitted model. If the outcome data are missing not at random, though, then more modeling must be done.
to paragraph 2: Here you’d want to model this. You’d have a model, something like logistic regression, where probability of missingness depends on workload. This could be fit in a Stan model. It could be that strong priors would be needed; you can run into trouble trying to fit such models from data alone, as such fits can depend uncomfortably on distributional assumptions.
Konkel continues:
For the Bayesian/imputation part: Is there a difference between what you describe, which I think would be what’s described in 11.1 of the Stan manual, and trying to combine the missing and non-missing data into a single vector, more like 11.3? We wouldn’t be interested in the value of the missing data, per se, so much as maximizing our number of observations in order to examine differences across experimental conditions. For the workload part: I understand what you’re saying in principle, but I’m having a little trouble imagining the implementation. I would model the probability of missingness depending on workload, but the data that’s missing is the workload! We’re running an experiment, so we don’t have a lot of other predictors with which to model/stand in for workload. Essentially we just have the experimental condition, although maybe I could jury-rig more predictors out of the details of the experiment in the moment when the workload response is supposed to occur…
My reply:
Those sections of the Stan model are all about how to fold in missing data into larger data structures. We’re working on allowing missing data types in Stan so that this hashing is done automatically—but until that’s done, yes, you’ll need to do this sort of trick to model missing data.
But in my point 1 above, if your missingness is only in the outcome variable in your regression, and you assume missing at random, then you can simply exclude the missing cases entirely from your analysis, and then you can impute them later in the generated quantities block. Then it’s there in the generated quantities block that you’ll combine the observed and imputed data.
For the workload model: Yes, you’re modeling the probability of missingness given something that’s missing in some cases. That’s why you’d need a model with informative priors. From a Bayesian standpoint, the model is fit all at once, so it can all be done.
To get started, I suggest coming up with a simple but reasonable model for missingness, then simulate fake complete data followed by a fake missingness pattern, and check that you can recover your missing-data model and your complete data model in that fake-data situation. You can then proceed from there. But if you can’t even do it with fake data, you’re sunk.