When anyone claims 80% power, I’m skeptical.

A policy analyst writes:

I saw you speak at ** on Bayesian methods. . . . I had been asked to consult on a large national evaluation of . . . [details removed to preserve anonymity] . . . and had suggested treading carefully around the use of Bayesian statistics in this study (basing it on all my expertise taken from a single graduate course taught ten years ago by a professor who didn’t even like Bayesian statistics). I left that meeting determined to learn more myself as well as to encourage Bayesian experts to help us bring this approach to my field of research. . . . I’ve been asked to review an evaluation proposal and provide feedback on what could help improve the evaluation. The study team is evaluating the effectiveness of a training . . . The study team suggests up front that a great deal of variance exists among these case workers (who come from a number of different agencies, so are clustered under [two factors]). The problem with all of this is that the number is still small [less than 100 people] . . . To recap, we have a believed heterogenous population . . . whose variation is expected to cluster under a number of factors . . . and all will get this training at the same time . . . The study team has proposed a GEE model and their power analysis was based on multiple regression. Their claim on power is that [less than 100] individuals is sufficient to achieve 80% power using multiple regression with no covariates. That is as much information as they give me, so their assumptions about the sample are unclear to me. I want to suggest they go back an actually run the power analysis on the study they are doing and based on the analysis they intend to run. I’m also suggesting they consider GLMM instead of GEE since they seem to feel that the variance that can be explained by these clusters is meaningful and could inform future trainings. I recall learning that when you have this scenario where there is this clustered variance, actual power is less than if you make an assumption of homogeneity across your population. In other words, that estimating power based on a multiple regression and a standard proxy for the expected variance of your sample, you would far over-estimate how much power you have. The problem is I can’t remember if this is true, why this is true, what words to google to remind myself. If I totally made this up and pulled it out of my butt. What I recall was that in my MLM courses they had suggested that MLM would actually achieve 80% power with a smaller sample when the cluster variable explained significant variance. I modeled all of this for another study where the baseline variance is already know and where a number of related past studies exist. What I found there was that if you just plug your N into GPower and use its preset values for variance, the sample it suggests is needed to achieve 80% power is far lower than what you get if you put in accurate values for variance. That if you move from GPower to a power software script for R where you can explicitly model power for GLMM and put in good estimates for all this, you get a number somewhat in-between the two. This is what I think they should do, but I don’t want to send them on a wild goose chase.

Here’s my reply:

When anyone claims 80% power, I’m skeptical. (See also here.) Power estimates are just about always set up in an optimistic way with the primary goal to get the study approved. I’d throw out any claim of 80% power. Instead start with the inputs: the assumptions of effect size and variation that were used to get that 80% power claim. I recommend you interrogate those assumptions carefully. In particular, guessed effect sizes are typically drawn from existing literature which overestimates effect sizes (Type M errors arising from selection on statistical significance), and these overestimates can be huge, perhaps no surprise given that, in designing their studies, researchers have every incentive to use high estimates for effectiveness of their interventions.
More generally, I’m not so interested in “power” as it is associated with the view that the goal of a study is to get statistical significance and then make claims with certainty. I think it’s better to accept ahead of time that, even after the study is done, residual uncertainty will be present. Don’t make statistical significance the goal, then there’s less pressure to cheat and get statistical significance, less overconfidence the study happens to result in statistical significance, and less disappointment if the results end up not being statistically significant.
To continue with the theme: I don’t like talking about “power,” but I agree with the general point that it’s a bad idea to do “low-power studies” (or, as I’d say, studies where the standard error of the parameter estimate is as high, or higher, than the true underlying effect size). The trouble is that a low-power study gives a noisy estimate, which if it does happen to be statistically significant, will cause the treatment effect to be overestimated.

Sometimes people think a low-power study isn’t so bad, in a no-harm, no-foul sort of way: if a study has low power, it’ll probably fail anyway. But in some sense the big risk with a low-power study is that it apparently succeeds; thus leading people into a false sense of confidence about effect size. That’s why we sometimes talk about statistical significance as a “winner’s curse” or a “deal with the devil.”

To return to your technical question: Yes, in general a multilevel analysis (or, for that matter, a correctly done GEE) will give a lower power estimate than a corresponding observation-level analysis that does not account for clustering. There are some examples of power analysis for multilevel data structures in chapter 20 of my book with Jennifer Hill.
When in doubt, I recommend fake-data simulation. This can take some work—you need to simulate predictors as well as outcomes, you need to make assumps about variance components, correlations, etc.—but in my experience the effort to make the assumps is well worth it in clarifying one’s thinking.

Which elicited the following response:

You raise concerns for the broader funding system itself. The grants require that awardees demonstrate in their proposal through a power analysis that their analytic method and sample size is sufficient to detect the expected effects of their intervention. Unfortunately it is very common for researchers to present something along these lines:

– We are looking to improve parenting practices through intervention X.– Past interventions have seen effect sizes with estimates that range from 0.1 to 0.6 (and in fact, the literature would show effect sizes ranging from between -0.6 to 0.6, but those negative effect sizes are left out of the proposal).– The researcher will conclude that their sample is sufficient to detect an effect size of 0.6 with 80% power and that this is reasonable.– The study is good to go.

I would look at that and generally conclude the study is underpowered and whatever effects they do or don’t find are going to be questionable. The uncertainty would be too great.

I like your idea of doing simulations. While I can’t recommend that awardees do that (It would be considered too onerous), I can suggest evaluators consider the option. I’ve done these for education studies. My experience is that if it is a well-studied field and the assumptions can be more readily inferred, they are not that time consuming to do.

I assume a problem with making these assumptions of effect size and variance would be the file drawer effect, the lack of published null or negative studies and the inadvertent or intentional ignoring of those that are published. [Actually, I think that forking paths—the ability of researchers to extract statistical significance from data that came from any individual study—is a much bigger deal than the file drawer. — ed.] I do think that some researchers take an erroneous view of these statistics that if no effect or a negative effect is found, its not meaningful. If a positive effect is found, it is meaningful and interpretable.

P.S. I have been highly privileged, first in receiving a world-class general education in college, then in receiving a world-class education in statistics in graduate school, then in having jobs for the past three decades in which I’ve been paid to explore the truth however I’ve seen fit, and during this period to have found hundreds of wonderful collaborators in both theory and application. As a result, I take very seriously the “service” aspect of my job, and I’m happy to give back to the community by sharing tips in this way.