Should we be concerned about MRP estimates being used in later analyses? Maybe. I recommend checking using fake-data simulation.

Someone sent in a question (see below). I asked if I could post the question and my reply on blog, and the person responded:

Absolutely, but please withhold my name because this is becoming a touchy issue within my department.

The boldface was in the original.

I get this a lot. There seems to be a lot of fear out there when it comes to questioning established procedures.

Anyway, here’s the question that the person sent in:

CDC has recently been using your multilevel estimation with post-stratification method to produce county, city, and census tract-level disease prevalence estimates (see https://www.cdc.gov/500cities/). The data source is the annual phone-based Behavioral Risk Factor Surveillance System (n=450k). CDC is not transparent about covariates included in the models used to construct the estimates, but as I understand it they are mostly driven by national individual-level associations between sociodemographic factors and disease prevalence. Presumably, the random effects would not influence a unit’s estimated prevalence much if the sample size from that unit is small (as is true for most cities/counties, and for many census tracts the sample size is zero). I am wondering if you are as troubled as I am by how these estimates are being used. First, websites like County Health Rankings and City Health Dashboard are providing these estimates to the public without any disclaimer that these are not actually random samples of cities/counties/tracts and may not reflect reality. Second, and more problematically, researchers are starting to conduct ecologic studies that analyze the association, for example, between census tract socioeconomic composition and obesity prevalence (It seems quite likely that the study is actually just identifying the individual-level association between income and obesity used to produce the estimates). I’ve now become involved in a couple of projects that are trying to analyze these estimates so it seems as though their use will increase over time. The only disclaimer that CDC provides is that the estimates shouldn’t be used to evaluate policy. Are you more confident about the use of these estimates than I am? I am also wondering if CDC should be more explicit in disclosing their limitations to prevent misuse.

My reply:

Wow, N = 450K. That’s quite a survey. (I know my correspondent called it “n,” but when it’s this big, I think the capital letter is warranted.) And here’s the page where they mention Mister P! And they have a web interface.

I’m not quite sure why you say the website provides the estimate “without any disclaimer.” Here’s one of the displays:

It’s not the prettiest graph in the world—I’ll grant you that—but it’s clearly labeled “Model-based estimates” right at the top.

I agree with you, though, in your concern that if these model-based estimates are being used in later analyses, there’s a risk of reification, in which county or city-level predictors that are used in the model can look automatically like good predictors of the outcomes. I’d guess this would be more of a concern with rare conditions than with something like coronary heart disease where the sample size will be (unfortunately) so large.

The right thing to do next, I think, is some fake-data simulation to see how much this should be a concern. CDC has already done some checking (from their methodology page, “CDC’s internal and external validation studies confirm the strong consistency between MRP model-based SAEs and direct BRFSS survey estimates at both state and county levels.”) and I guess you could do more.

Overall, I’m positively inclined toward these MRP estimates because I’d guess it’s much better than the alternatives such as raw or weighted local averages or some sort of postprocessed analysis of weighted averages. I think those approaches would have lots more problems.

In any case, it’s cool to see my method being used by people who’ve never met me! Mister P is all grown up.

P.S. My correspondent provides further background:

The CDC generates prevalence estimates for various diseases at the county level (or smaller) by applying MRP to the national Behavioral Risk Factor Surveillance System. Unlike for other diseases, they’ve documented their methods for diabetes. Their model defines 12 population strata per county (2 races x 2 genders x 3 age groups) and incorporates random effects for stratum, county, and state. There are no other variables at any level in the model. A number of papers use the MRP-derived data to estimate associations between, for example, PM2.5 and diabetes prevalence. Do you think this is a valid approach? Would it be valid if all of the MRP covariates are included in the model?

My response:

  1. Regarding the MRP model, it is what it is. Including more demographic factors is better, but adjusting for these 12 cells per county is better than not adjusting, I’d think. One thing I do recommend is to use group-level predictors. In this case, the group is county, and lots of county-level predictors will be available that will be relevant for predicting health outcomes.

  2. Regarding the postprocessing using the MRP estimates: Sure, it should be better to fold the two models together, but the two-stage approach (first use MRP to estimate prevalences, then fit another model) could work ok too, with some loss of efficiency. Again, I’d recommend using fake-data simulation to estimate the statistical properties of this approach for the problem at hand.