Two things about power

Statistical power is a central concept in inference and experimental design. Nevertheless, some basic connections between the statistical power of experiments and the robustness of inference are often overlooked in practice. When running experiments with low power you risk much more than missing the detection of true effects. In this post I share two brief lessons about power:

running underpowered experiments can lead to your findings having an unexpectedly high false discovery rate (FDR)

if you have enough power you can (almost) ignore the differences between frequentist and Bayesian approaches to analyzing simple A/B tests

In short, power is key requirement for robust and reliable inference.

Suppose that you’re a data scientist and that as part of your job you sometimes run and analyze A/B tests. Wary of being tricked by noise you are always careful to calculate p-values and summarize your findings with confidence intervals. You’re confident that most of your statistically significant results will turn out to be real. Like most practitioners you run significance tests with size . If you look back over your past statistically significant results you expect 95% of them to be real findings so that only an acceptably low 5% of them will be false discoveries. In other words, when you say you’ve found something you expect that you’re right about 95% of the time.

But then another colleague points out that, in retrospect, less than 50% of your past findings have actually stood up to further testing. Their appraisal of your reliability is much closer to a coin toss than to 95%. You’re surprised by this claim. It appears that these claims are in conflict: after all, you were a good statistical citizen and carefully controlled the type I error rate in your experiments. But it turns out to be very possible for both claims to be true and the issue stems from confusing and . More on that in the next section.

In fact, not only is it possible to be very unreliable in identifying statistically significant results while controlling the type I error rate, it is even likely if you overlook the power of your experiments. Far from an imaginary scenario, this is actually a very serious problem in many scientific disciplines. For the interested reader, there are a number of approaches for directly controlling the FDR[1].

Bayes’ rule and a toy model

So what’s going on? A toy model (described here and elsewhere) is helpful for resolving the apparent contradiction. In addition to the significance of your testing procedure, the rate at which you make false discoveries will depend critically on both:

the power: the probability that you will detect an effect given that there is one

the prior odds: of the hypotheses that you consider, what proportion will correspond to actual non-null effects?

This is nicely illustrated by a simple calculation. But first, let’s review the basics of hypothesis testing.

Hypothesis testing

Suppose you are testing a hypothesis against a null and will reject if a test statistic exceeds some critical threshold . A typical example of would be the (standarized) difference in an outcome variable between two cells of an experiment. For example, if the difference is big enough when comparing two algorithms, we will claim that Algorithm A is better than Algorithm B. A vast majority of one-sided frequentist testing fits this general pattern. By carefully choosing the threshold you can bound the type I error rate you are willing to tolerate:

The goal of the testing procedure, and the p-value, is essentially to answer the question: “Is our data inconsistent with the null hypothesis?” If the data is too inconsistent we take it as a proof by contradiction and reject our null hypothesis . Setting allows us to control the chance we will mistakenly reject when it is true. Of course, being inconsistent with the null hypothesis does not necessarily imply being more consistent with the alternative hypothesis . Indeed, the fact that the p-value doesn’t depend on in any way (except through defining the rejection region —e.g., whether the test is one or two-sided)—only on —is a common criticism of the frequentist approach to testing. Nonetheless, this formulation allows us to control the type I error rate of the testing procedure.

To simplify the exposition for the remainder of the post we will assume that is a point hypothesis that specifies a single value for the unknown parameter. In practice it is much more common to consider composite hypotheses that consist of many possible values (e.g., the lift is positive), but we restrict ourselves to the simpler case of a point hypothesis without changing the lessons of this post.

In the case where is a point hypothesis, the power of our experiment is the probability that we successfully reject the null hypothesis when it is not true:

For example, if Algorithm A is, in reality, much better than algorithm B, we would expect that the difference between the two cells is much more likely to be large than if there was no difference between the two algorithms (the null case). The difference in the probability of between and is what allows us to differentiate between them.

Bayes’ rule

Suppose now that we can assign a prior probability to whether the null hypothesis is true:

In addition to thinking about this as the prior probability for this experiment we could view it as reflecting the general proportion of false null hypotheses that we end up testing. This of course depends on your domain and the kinds of questions that you are asking. (For example: how many tested changes to your website actually impact conversion? How many new algorithms make clients happier?)

Given the significance level, the power and the prior odds we can now use Bayes’ rule to compute the probability that a finding labeled as significant is a false discovery as:

which can be compactly rewritten in terms of odds-ratios:

For the probability of making a false discovery to be low we need the posterior odds of to be low. This depends on both the ratio of size-to-power as well as the prior odds ratio . For a given experiment, the lever most under our control is usually the power which is a function of the sample size (power increasing with sample size).

To illustrate it is helpful to consider an extreme. Suppose that you ran an experiment with even prior odds and low power having . While you would successfully constrain the type I error to the probability of making a false discovery would be:

Fundamentally, this is because our test statistic has the same probability of exceeding under both and . To limit the probability of a false discovery to 0.9 would require a power of at least:

The expression for the posterior odds also means that the type of questions you ask can work against you. For example, if you only test “moonshot” alternative hypotheses that are unlikely to be true this will increase the posterior odds of a false discovery via the odds ratio .

The lesson

Controlling your type I error rate is not enough to guarantee a reasonable FDR. Power and the plausibility of questions that you ask play an equally important role.

How much power is enough?

In the previous example we saw the importance of running adequately powered experiments. The conventional, if arbitrary, goal is to aim for having a power of :

But power is typically a function of effect size. Often the entire point of an experiment is to estimate the size of an effect—how could we know it in advance? Fundamentally we often can’t. But we can leverage prior information and domain knowledge. For example, if we’ve run similar experiments in the past we might have an approximate idea of the size of the effect we are trying to detect. We can use that as a benchmark when calculating power.

“Prior information,” “domain knowledge”: this is all starting to sound a bit Bayesian. Indeed, selecting effect sizes for a power analysis can have a very Bayesian character. The connection runs deeper, as we’ll see in the next section.

A good frequentist would never interpret a p-value as the probability that the null hypothesis is true. But it can be enormously tempting. And despite all your efforts to the contrary it is likely that many of your colleagues don’t appreciate the distinction.

So, really, how wrong is it to treat a p-value as (one minus) the posterior probability that the null hypothesis is true? In general, it’s bad. But in some cases a p-value is a very good approximation to a posterior probability. Here we examine that approximation in a common testing scenario.

The two-sample problem

Suppose we want to compare the average across two groups, as in an A/B test. Assume that we observe iid data from two samples as:

for some distributions and with expectations:

and finite variances, which for simplicity we will assume that we already know:

Note that these are very weak assumptions. We require no assumptions of normality from or (which is good, because many business outcome variables are not well approximated by a normal distribution). Given the data we define the usual sample means and their difference:

Given the distribution of is easily shown to be approximately:

with

where the normality follows from the Central Limit Theorem.

Testing

Suppose we run our experiment and find that . This is encouraging, but before breaking out the champagne we want to test the null hypothesis to make sure we’ve found more than noise. We would reject our null when where is a quantile of the standard normal distribution: . The p-value associated with this test is:

and the associated power for detecting an effect of size is then:

Our experience with past experiments can help us choose a reasonable when designing our experiment.

A Bayesian analysis with a non-informative prior

What would a Bayesian analysis look like? A popular choice for a non-informative prior is the Jeffreys prior, which is invariant to monotone reparameterizations of the likelihood. Assuming that we know and the Jeffreys prior for in this problem is particularly simple: it is the (improper) flat prior on the real line that implies every value is equally likely. We’ve successfully avoided choosing a prior that assumes any value is more likely than another.

With this choice of prior a Bayesian analysis and a frequentist analysis are likely to be identical. In particular, a Bayesian credible interval (a set that contains the true parameter with high posterior probability) exactly corresponds with the usual frequentist maximum likelihood confidence interval. This equivalence is not true in general and is a consequence of our choice of a flat prior. Among other things this means that the p-value will be equal to (one minus) the posterior probability that the null hypothesis is true!

A Bayesian analysis with an informative prior

But suppose that we have some prior information that we would like to use. For example, suppose that based on past experiments we know that the typical effect has a size of approximately . For reasons of convenience we could choose a normal distribution to summarize our prior beliefs about :

Note that we are not assuming that our data is normally distributed—just our prior beliefs about . The symmetry of the prior distribution about zero reflects our uncertainty about the sign of the effect, while the standard deviation reflects our expectations about its size. In particular, noting that:

we could encode our beliefs about by setting . You could always entertain other prior distributions (for example, something with heavier tails) but the normal distribution seems a reasonable choice of a symmetric unimodal distribution.

It is also a very convenient choice. The self-conjugacy of the normal distribution means that the posterior distribution of given the data is also normally distributed

With this posterior distribution, it is then easy to calculate the probability that one cell is doing better than another:

where

Meeting in the middle

Two things are immediately apparent when comparing the frequentist and Bayesian analysis. The first is that the p-value provides an upper bound on the posterior probability:

because . The second is that if , as would happen if and were both very large then and:

In this case the p-value and (one minus) posterior probability that are approximately the same!

Connection to power: What does it mean for ?

Recall that is the variance of the estimator conditioned on , and that is the standard error. The condition requires that the standard error of your estimator be much smaller than the standard deviation of your prior. Obtaining enough data so that is then similar to a power calculation: we want the standard error of our estimator to be smaller than the typical effect size we expect to find.

This power is large only when is large, and hence when is small. This is equivalent to the condition if . That is, if we power our experiment to detect the typical effect size encoded by our prior. This means that if your experiment is sufficiently powered to detect effect sizes that you believe to be plausible you will likely be able to treat a frequentist confidence interval as a credible interval. And recall from the discussion on FDR that if you don’t have enough power to detect the effects you think are plausible you have other problems.

In particular, for the test to have power of at least we would require:

which using would give the bound:

This brings us to the punchline—we can bound the posterior probability using (a slightly) modified p-value! In particular:

which is the calculation for a p-value associated with slightly reduced z-score of .

Taking liberties with p-values

The consistency between the p-value and the posterior probability for well-powered experiments is not without limitations. But it nonetheless inspires confidence. As Brad Efron put it, “It is always a good sign when a statistical procedure enjoys both frequentist and Bayesian support.” It also has the significant, if regrettably necessary, advantage of making the frequentist results robust to the Bayesian interpretation that is so tempting for statisticians and non-statisticans alike. But this robustness is closely tied to having sufficiently powered your experiment.

Statistical power gives you more than the ability to detect effects that you are trying to find. As we’ve seen, it plays a key role in determining the false discovery rate of your results—with the risk of specious findings dereasing when power is high. It also provides a bridge between frequentist and Bayesian analyses of experiments—making your findings more robust to interpretation from both perspectives. Taken together these lead to a simple lesson: When you can, you should run experiments with enough power. And proceed with caution when power is low.

(Much of this content inspired by conversations with Jacob Bien.)