When I give talks about probabilistic programming and Bayesian statistics, I usually gloss over the details of how inference is actually performed, treating it as a black box essentially. The beauty of probabilistic programming is that you actually don’t have to understand how the inference works in order to build models, but it certainly helps.
When I presented a new Bayesian model to Quantopian’s CEO, Fawce, who wasn’t trained in Bayesian stats but is eager to understand it, he started to ask about the part I usually gloss over: “Thomas, how does the inference actually work? How do we get these magical samples from the posterior?”.
Now I could have said: “Well that’s easy, MCMC generates samples from the posterior distribution by constructing a reversible Markov-chain that has as its equilibrium distribution the target posterior distribution. Questions?”.
That statement is correct, but is it useful? My pet peeve with how math and stats are taught is that no one ever tells you about the intuition behind the concepts (which is usually quite simple) but only hands you some scary math. This is certainly the way I was taught and I had to spend countless hours banging my head against the wall until that euraka moment came about. Usually things weren’t as scary or seemingly complex once I deciphered what it meant.
This blog post is an attempt at trying to explain the intuition behind MCMC sampling (specifically, the random-walk Metropolis algorithm). Critically, we’ll be using code examples rather than formulas or math-speak. Eventually you’ll need that but I personally think it’s better to start with the an example and build the intuition before you move on to the math.
The problem and its unintuitive solution¶
Lets take a look at Bayes formula:
We have $P(\theta | x)$, the probability of our model parameters $\theta$ given the data $x$ and thus our quantity of interest. To compute this we multiply the prior $P(\theta)$ (what we think about $\theta$ before we have seen any data) and the likelihood $P(x | \theta)$, i.e. how we think our data is distributed. This nominator is pretty easy to solve for. |
However, lets take a closer look at the denominator. $P(x)$ which is also called the evidence (i.e. the evidence that the data x was generated by this model). We can compute this quantity by integrating over all possible parameter values:
This is the key difficulty with Bayes formula – while the formula looks innocent enough, for even slightly non-trivial models you just can’t compute the posterior in a closed-form way.
Now we might say “OK, if we can’t solve something, could we try to approximate it? For example, if we could somehow draw samples from that posterior we can Monte Carlo approximate it.” Unfortunately, to directly sample from that distribution you not only have to solve Bayes formula, but also invert it, so that’s even harder.
Then we might say “Well, instead let’s construct an ergodic, reversible Markov chain that has as an equilibrium distribution which matches our posterior distribution”. I’m just kidding, most people wouldn’t say that as it sounds bat-shit crazy. If you can’t compute it, can’t sample from it, then constructing that Markov chain with all these properties must be even harder.
The surprising insight though is that this is actually very easy and there exist a general class of algorithms that do this called Markov chain Monte Carlo (constructing a Markov chain to do Monte Carlo approximation).
Setting up the problem¶
First, lets import our modules.
In [1]:
Lets generate some data: 20 points from a normal centered around zero. Our goal will be to estimate the posterior of the mean mu
(we’ll assume that we know the standard deviation to be 1).
In [2]:
In [3]:
Next, we have to define our model. In this simple case, we will assume that this data is normal distributed, i.e. the likelihood of the model is normal. As you know, a normal distribution has two parameters – mean $\mu$ and standard deviation $\sigma$. For simplicity, we’ll assume we know that $\sigma = 1$ and we’ll want to infer the posterior for $\mu$. For each parameter we want to infer, we have to chose a prior. For simplicity, lets also assume a Normal distribution as a prior for $\mu$. Thus, in stats speak our model is:
What is convenient, is that for this model, we actually can compute the posterior analytically. That’s because for a normal likelihood with known standard deviation, the normal prior for mu
is conjugate (conjugate here means that our posterior will follow the same distribution as the prior), so we know that our posterior for $\mu$ is also normal. We can easily look up on wikipedia how we can compute the parameters of the posterior. For a mathemtical derivation of this, see here.
In [4]:
This shows our quantity of interest, the probability of $\mu$’s values after having seen the data, taking our prior information into account. Lets assume, however, that our prior wasn’t conjugate and we couldn’t solve this by hand which is usually the case.
Explaining MCMC sampling with code¶
Now on to the sampling logic. At first, you find starting parameter position (can be randomly chosen), lets fix it arbitrarily to:
Then, you propose to move (jump) from that position somewhere else (that’s the Markov part). You can be very dumb or very sophisticated about how you come up with that proposal. The Metropolis sampler is very dumb and just takes a sample from a normal distribution (no relationship to the normal we assume for the model) centered around your current mu
value (i.e. mu_current
) with a certain standard deviation (proposal_width
) that will determine how far you propose jumps (here we’re use scipy.stats.norm
):
Next, you evaluate whether that’s a good place to jump to or not. If the resulting normal distribution with that proposed mu
explaines the data better than your old mu
, you’ll definitely want to go there. What does “explains the data better” mean? We quantify fit by computing the probability of the data, given the likelihood (normal) with the proposed parameter values (proposed mu
and a fixed sigma = 1
). This can easily be computed by calculating the probability for each data point using scipy.stats.normal(mu, sigma).pdf(data)
and then multiplying the individual probabilities, i.e. compute the likelihood (usually you would use log probabilities but we omit this here):
Up until now, we essentially have a hill-climbing algorithm that would just propose movements into random directions and only accept a jump if the mu_proposal
has higher likelihood than mu_current
. Eventually we’ll get to mu = 0
(or close to it) from where no more moves will be possible. However, we want to get a posterior so we’ll also have to sometimes accept moves into the other direction. The key trick is by dividing the two probabilities,
we get an acceptance probability. You can already see that if p_proposal
is larger, that probability will be > 1
and we’ll definitely accept. However, if p_current
is larger, say twice as large, there’ll be a 50% chance of moving there:
This simple procedure gives us samples from the posterior.
Why does this make sense?¶
Taking a step back, note that the above acceptance ratio is the reason this whole thing works out and we get around the integration. We can show this by computing the acceptance ratio over the normalized posterior and seeing how it’s equivalent to the acceptance ratio of the unnormalized posterior (lets say $\mu_0$ is our current position, and $\mu$ is our proposal):
In words, dividing the posterior of proposed parameter setting by the posterior of the current parameter setting, $P(x)$ – that nasty quantity we can’t compute – gets canceled out. So you can intuit that we’re actually dividing the full posterior at one position by the full posterior at another position (no magic here). That way, we are visiting regions of high posterior probability relatively more often than those of low posterior probability.
Putting it all together¶
In [5]:
To visualize the sampling, we’ll create plots for some quantities that are computed. Each row below is a single iteration through our Metropolis sampler.
The first columns is our prior distribution – what our belief about $\mu$ is before seeing the data. You can see how the distribution is static and we only plug in our $\mu$ proposals. The vertical lines represent our current $\mu$ in blue and our proposed $\mu$ in either red or green (rejected or accepted, respectively).
The 2nd column is our likelihood and what we are using to evaluate how good our model explains the data. You can see that the likelihood function changes in response to the proposed $\mu$. The blue histogram which is our data. The solid line in green or red is the likelihood with the currently proposed mu
. Intuitively, the more overlap there is between likelihood and data, the better the model explains the data and the higher the resulting probability will be. The dotted line of the same color is the proposed mu
and the dotted blue line is the current mu
.
The 3rd column is our posterior distribution. Here I am displaying the normalized posterior but as we found out above, we can just multiply the prior value for the current and proposed $\mu$’s by the likelihood value for the two $\mu$’s to get the unnormalized posterior values (which we use for the actual computation), and divide one by the other to get our acceptance probability.
The 4th column is our trace (i.e. the posterior samples of $\mu$ we’re generating) where we store each sample irrespective of whether it was accepted or rejected (in which case the line just stays constant).
Note that we always move to relatively more likely $\mu$ values (in terms of their posterior density), but only sometimes to relatively less likely $\mu$ values, as can be seen in iteration 14 (the iteration number can be found at the top center of each row).
In [6]:
Now the magic of MCMC is that you just have to do that for a long time, and the samples that are generated in this way come from the posterior distribution of your model. There is a rigorous mathematical proof that guarantees this which I won’t go into detail here.
To get a sense of what this produces, lets draw a lot of samples and plot them.
In [7]:
This is usually called the trace. To now get an approximation of the posterior (the reason why we’re doing all this), we simply take the histogram of this trace. It’s important to keep in mind that although this looks similar to the data we sampled above to fit the model, the two are completely separate. The below plot represents our belief in mu
. In this case it just happens to also be normal but for a different model, it could have a completely different shape than the likelihood or prior.
In [8]:
As you can see, by following the above procedure, we get samples from the same distribution as what we derived analytically.
Above we set the proposal width to 0.5
. That turned out to be a pretty good value. In general you don’t want the width to be too narrow because your sampling will be inefficient as it takes a long time to explore the whole parameter space and shows the typical random-walk behavior:
In [9]:
But you also don’t want it to be so large that you never accept a jump:
In [10]:
Note, however, that we are still sampling from our target posterior distribution here as guaranteed by the mathemtical proof, just less efficiently:
In [11]:
With more samples this will eventually look like the true posterior. The key is that we want our samples to be independent of each other which cleary isn’t the case here. Thus, one common metric to evaluate the efficiency of our sampler is the autocorrelation – i.e. how correlated a sample i
is to sample i-1
, i-2
, etc:
In [12]:
Obviously we want to have a smart way of figuring out the right step width automatically. One common method is to keep adjusting the proposal width so that roughly 50% proposals are rejected.
Extending to more complex models¶
Now you can easily imagine that we could also add a sigma
parameter for the standard-deviation and follow the same procedure for this second parameter. In that case, we would be generating proposals for mu
and sigma
but the algorithm logic would be nearly identical. Or, we could have data from a very different distribution like a Binomial and still use the same algorithm and get the correct posterior. That’s pretty cool and a huge benefit of probabilistic programming: Just define the model you want and let MCMC take care of the inference.
For example, the below model can be written in PyMC3
quite easily. Below we also use the Metropolis sampler (which automatically tunes the proposal width) and see that we get identical results. Feel free to play around with this and change the distributions. For more information, as well as more complex examples, see the PyMC3 documentation.
In [13]:
[—————–100%—————–] 15000 of 15000 complete in 1.7 sec
Conclusions¶
We glossed over a lot of detail which is certainly important but there are many other posts that deal with that. Here, we really wanted to communicate the idea of MCMC and the Metropolis sampler. Hopefully you will have gathered some intuition which will equip you to read one of the more technical introductions to this topic.
Other, more fancy, MCMC algorithms like Hamiltonian Monte Carlo actually work very similar to this, they are just much more clever in proposing where to jump next.
This blog post was written in a Jupyter Notebook, you can find the underlying NB with all its code here.
Support me on Patreon¶
Finally, if you enjoyed this blog post, consider supporting me on Patreon which allows me to devote more time to writing new blog posts.