(c) 2017 by Thomas Wiecki
Hierarchical models are underappreciated. Hierarchies exist in many data sets and modeling them appropriately adds a boat load of statistical power (the common metric of statistical power). I provided an introduction to hierarchical models in a previous blog post: Best Of Both Worlds: Hierarchical Linear Regression in PyMC3”, written with Danne Elbers. See also my interview with FastForwardLabs where I touch on these points.
Here I want to focus on a common but subtle problem when trying to estimate these models and how to solve it with a simple trick. Although I was somewhat aware of this trick for quite some time it just recently clicked for me. We will use the same hierarchical linear regression model on the Radon data set from the previous blog post, so if you are not familiar, I recommend to start there.
I will then use the intuitions we’ve built up to highlight a subtle point about expectations vs modes (i.e. the MAP). Several talks by Michael Betancourt have really expanded my thinking here.
In [31]:
The intuitive specification¶
Usually, hierachical models are specified in a centered way. In a regression model, individual slopes would be centered around a group mean with a certain group variance, which controls the shrinkage:
In [2]:
In [3]:
Auto-assigning NUTS sampler… Initializing NUTS using advi… Average ELBO = -1,090.4: 100%|██████████| 200000/200000 [00:37<00:00, 5322.37it/s] Finished [100%]: Average ELBO = -1,090.4 100%|██████████| 5000/5000 [01:15<00:00, 65.96it/s]
In [32]:
I have seen plenty of traces with terrible convergences but this one might look fine to the unassuming eye. Perhaps sigma_b
has some problems, so let’s look at the Rhat:
In [5]:
Rhat(sigma_b) = 1.0008132807924583
Not too bad – well below 1.01. I used to think this wasn’t a big deal but Michael Betancourt in his StanCon 2017 talk makes a strong point that it is actually very problematic. To understand what’s going on, let’s take a closer look at the slopes b
and their group variance (i.e. how far they are allowed to move from the mean) sigma_b
. I’m just plotting a single chain now.
In [33]:
sigma_b
seems to drift into this area of very small values and get stuck there for a while. This is a common pattern and the sampler is trying to tell you that there is a region in space that it can’t quite explore efficiently. While stuck down there, the slopes b_i
become all squished together. We’ve entered The Funnel of Hell (it’s just called the funnel, I added the last part for dramatic effect).
The Funnel of Hell (and how to escape it)¶
Let’s look at the joint posterior of a single slope b
(I randomly chose the 75th one) and the slope group variance sigma_b
.
In [34]:
This makes sense, as the slope group variance goes to zero (or, said differently, we apply maximum shrinkage), individual slopes are not allowed to deviate from the slope group mean, so they all collapose to the group mean.
While this property of the posterior in itself is not problematic, it makes the job extremely difficult for our sampler. Imagine a Metropolis-Hastings exploring this space with a medium step-size (we’re using NUTS here but the intuition works the same): in the wider top region we can comfortably make larger jumps to explore the space efficiently. However, once we move to the narrow bottom region we can change b_75
and sigma_b
only by tiny amounts. This causes the sampler to become trapped in that region of space. Most of the proposals will be rejected because our step-size is too large for this narrow part of the space and exploration will be very inefficient.
You might wonder if we could somehow choose the step-size based on the denseness (or curvature) of the space. Indeed that’s possible and it’s called Riemannian HMC. It works very well but is quite costly to run. Here, we will explore a different, simpler method.
Finally, note that this problem does not exist for the intercept parameters a
. Because we can determine individual intercepts a_i
with enough confidence, sigma_a
is not small enough to be problematic. Thus, the funnel of hell can be a problem in hierarchical models, but it does not have to be. (Thanks to John Hall for pointing this out).
Reparameterization¶
If we can’t easily make the sampler step-size adjust to the region of space, maybe we can adjust the region of space to make it simpler for the sampler? This is indeed possible and quite simple with a small reparameterization trick, we will call this the non-centered version.
In [8]:
Pay attention to the definitions of a_offset
, a
, b_offset
, and b
and compare them to before (commented out). What’s going on here? It’s pretty neat actually. Instead of saying that our individual slopes b
are normally distributed around a group mean (i.e. modeling their absolute values directly), we can say that they are offset from a group mean by a certain value (b_offset
; i.e. modeling their values relative to that mean). Now we still have to consider how far from that mean we actually allow things to deviate (i.e. how much shrinkage we apply). This is where sigma_b
makes a comeback. We can simply multiply the offset by this scaling factor to get the same effect as before, just under a different parameterization. For a more formal introduction, see e.g. Betancourt & Girolami (2013).
Critically, b_offset
and sigma_b
are now mostly independent. This will become more clear soon. Let’s first look at if this transform helped our sampling:
In [9]:
Auto-assigning NUTS sampler… Initializing NUTS using advi… Average ELBO = -1,086.3: 100%|██████████| 200000/200000 [00:42<00:00, 4678.99it/s] Finished [100%]: Average ELBO = -1,086.3 100%|██████████| 5000/5000 [01:05<00:00, 76.81it/s]
In [35]:
That looks much better as also confirmed by the joint plot:
In [36]:
To really drive this home, let’s also compare the sigma_b
marginal posteriors of the two models:
In [37]:
That’s crazy – there’s a large region of very small sigma_b
values that the sampler could not even explore before. In other words, our previous inferences (“Centered”) were severely biased towards higher values of sigma_b
. Indeed, if you look at the previous blog post the sampler never even got stuck in that low region causing me to believe everything was fine. These issues are hard to detect and very subtle, but they are meaningful as demonstrated by the sizable difference in posterior mean.
But what does this concretely mean for our analysis? Over-estimating sigma_b
means that we have a biased (=false) belief that we can tell individual slopes apart better than we actually can. There is less information in the individual slopes than what we estimated.
Why does the reparameterized model work better?¶
To more clearly understand why this model works better, let’s look at the joint distribution of b_offset
:
In [38]:
Out[38]:
This is the space the sampler sees; you can see how the funnel is flattened out. We can freely change the (relative) slope offset parameters even if the slope group variance is tiny as it just acts as a scaling parameter.
Note that the funnel is still there – it’s a perfectly valid property of the model – but the sampler has a much easier time exploring it in this different parameterization.
Why hierarchical models are Bayesian¶
Finally, I want to take the opportunity to make another point that is not directly related to hierarchical models but can be demonstrated quite well here.
Usually when talking about the perils of Bayesian statistics we talk about priors, uncertainty, and flexibility when coding models using Probabilistic Programming. However, an even more important property is rarely mentioned because it is much harder to communicate. Ross Taylor touched on this point in his tweet:
Michael Betancourt makes a similar point when he says “Expectations are the only thing that make sense.”
But what’s wrong with maxima/modes? Aren’t those really close to the posterior mean (i.e. the expectation)? Unfortunately, that’s only the case for the simple models we teach to build up intuitions. In complex models, like the hierarchical one, the MAP can be far away and not be interesting or meaningful at all.
Let’s compare the posterior mode (i.e. the MAP) to the posterior mean of our hierachical linear regression model:
In [22]:
Warning: Desired error not necessarily achieved due to precision loss. Current function value: 61.050344 Iterations: 271 Function evaluations: 639 Gradient evaluations: 628
In [23]:
Out[23]:
In [24]:
Out[24]:
As you can see, the slopes are all identical and the group slope variance is effectively zero. The reason is again related to the funnel. The MAP only cares about the probability density which is highest at the bottom of the funnel.
But if you could only choose one point in parameter space to summarize the posterior above, would this be the one you’d pick? Probably not.
Let’s instead look at the Expected Value (i.e. posterior mean) which is computed by integrating probability density and volume to provide probabilty mass – the thing we really care about. Under the hood, that’s the integration performed by the MCMC sampler.
In [25]:
Out[25]:
In [26]:
Quite a difference. This also explains why it can be a bad idea to use the MAP to initialize your sampler: in certain models the MAP is not at all close to the region you want to explore (i.e. the “typical set”).
This strong divergence of the MAP and the Posterior Mean does not only happen in hierarchical models but also in high dimensional ones, where our intuitions from low-dimensional spaces gets twisted in serious ways. This talk by Michael Betancourt makes the point quite nicely.
So why do people – especially in Machine Learning – still use the MAP/MLE? As we all learned in high school first hand, integration is much harder than differentation. This is really the only reason.
Final disclaimer: This might provide the impression that this is a property of being in a Bayesian framework, which is not true. Technically, we can talk about Expectations vs Modes irrespective of that. Bayesian statistics just happens to provide a very intuitive and flexible framework for expressing and estimating these models.
See here for the underlying notebook of this blog post.
Acknowledgements¶
Thanks to Jon Sedar for helpful comments on an earlier draft.
In [27]:
07/02/2017
CPython 3.5.2 IPython 5.1.0
numpy 1.11.2 scipy 0.18.1 pandas 0.19.1 matplotlib 1.5.3 seaborn 0.7.1 patsy 0.4.1 pymc3 3.0 theano 0.7.0 joblib 0.10.3
compiler : GCC 4.8.2 20140120 (Red Hat 4.8.2-15) system : Linux release : 4.4.0-59-generic machine : x86_64 processor : x86_64 CPU cores : 4 interpreter: 64bit Git hash : 5d2767de3bbae72c7de7965f2afcb533df8e7ad3