Bayesian Deep Learning Part II: Bridging PyMC3 and Lasagne to build a Hierarchical Neural Network

(c) 2016 by Thomas Wiecki

Recently, I blogged about Bayesian Deep Learning with PyMC3 where I built a simple hand-coded Bayesian Neural Network and fit it on a toy data set. Today, we will build a more interesting model using Lasagne, a flexible Theano library for constructing various types of Neural Networks. As you may know, PyMC3 is also using Theano so having the Artifical Neural Network (ANN) be built in Lasagne, but placing Bayesian priors on our parameters and then using variational inference (ADVI) in PyMC3 to estimate the model should be possible. To my delight, it is not only possible but also very straight forward.

Below, I will first show how to bridge PyMC3 and Lasagne to build a dense 2-layer ANN. We’ll then use mini-batch ADVI to fit the model on the MNIST handwritten digit data set. Then, we will follow up on another idea expressed in my last blog post – hierarchical ANNs. Finally, due to the power of Lasagne, we can just as easily build a Hierarchical Bayesian Convolution ANN with max-pooling layers to achieve 98% accuracy on MNIST.

Most of the code used here is borrowed from the Lasagne tutorial.

In [2]:

Data set: MNIST¶

We will be using the classic MNIST data set of handwritten digits. Contrary to my previous blog post which was limited to a toy data set, MNIST is an actually challenging ML task (of course not quite as challening as e.g. ImageNet) with a reasonable number of dimensions and data points.

In [3]:

In [4]:

Model specification¶

I imagined that it should be possible to bridge Lasagne and PyMC3 just because they both rely on Theano. However, it was unclear how difficult it was really going to be. Fortunately, a first experiment worked out very well but there were some potential ways in which this could be made even easier. I opened a GitHub issue on Lasagne’s repo and a few days later, PR695 was merged which allowed for an ever nicer integration fo the two, as I show below. Long live OSS.

First, the Lasagne function to create an ANN with 2 fully connected hidden layers with 800 neurons each, this is pure Lasagne code taken almost directly from the tutorial. The trick comes in when creating the layer with lasagne.layers.DenseLayer where we can pass in a function init which has to return a Theano expression to be used as the weight and bias matrices. This is where we will pass in our PyMC3 created priors which are also just Theano expressions:

In [5]:

Next, the function which create the weights for the ANN. Because PyMC3 requires every random variable to have a different name, we’re creating a class instead which creates uniquely named priors.

The priors act as regularizers here to try and keep the weights of the ANN small. It’s mathematically equivalent to putting a L2 loss term that penalizes large weights into the objective function, as is commonly done.

In [6]:

If you compare what we have done so far to the previous blog post, it’s apparent that using Lasagne is much more comfortable. We don’t have to manually keep track of the shapes of the individual matrices, nor do we have to handle the underlying matrix math to make it all fit together.

Next are some functions to set up mini-batch ADVI, you can find more information in the prior blog post.

In [25]:

Putting it all together¶

Lets run our ANN with mini-batch ADVI:

In [8]:

Average ELBO = -104,369.28: 100%|██████████| 50000/50000 [4:06:03<00:00, 2.63it/s] Finished minibatch ADVI: ELBO = -99,915.55 100%|██████████| 500/500 [00:45<00:00, 10.44it/s] 100%|██████████| 100/100 [01:48<00:00, 1.13it/s]

Make sure everything converged:

In [9]:

In [10]:

Out[10]:

In [11]:

Accuracy on test data = 89.81%

The performance is not incredibly high but hey, it seems to actually work.

Hierarchical Neural Network: Learning Regularization from data¶

The connection between the standard deviation of the weight prior to the strengh of the L2 penalization term leads to an interesting idea. Above we just fixed sd=0.1 for all layers, but maybe the first layer should have a different value than the second. And maybe 0.1 is too small or too large to begin with. In Bayesian modeling it is quite common to just place hyperpriors in cases like this and learn the optimal regularization to apply from the data. This saves us from tuning that parameter in a costly hyperparameter optimization. For more information on hierarchical modeling, see my other blog post.

In [26]:

In [27]:

Average ELBO = -110,551.60: 100%|██████████| 50000/50000 [4:46:04<00:00, 2.74it/s] Finished minibatch ADVI: ELBO = -107,691.92 100%|██████████| 500/500 [00:44<00:00, 11.21it/s] 100%|██████████| 100/100 [01:56<00:00, 1.15s/it]

In [28]:

Accuracy on test data = 92.25999999999999%

We get a small but nice boost in accuracy. Let’s look at the posteriors of our hyperparameters:

In [29]:

Interestingly, they all are pretty different suggesting that it makes sense to change the amount of regularization that gets applied at each layer of the network.

Convolutional Neural Network¶

This is pretty nice but everything so far would have also been pretty simple to implement directly in PyMC3 as I have shown in my previous post. Where things get really interesting, is that we can now build way more complex ANNs, like Convolutional Neural Nets:

In [14]:

In [15]:

Average ELBO = -20,160.79: 100%|██████████| 50000/50000 [12:39:53<00:00, 1.21it/s] Finished minibatch ADVI: ELBO = -18,618.35 100%|██████████| 500/500 [00:04<00:00, 120.35it/s] 100%|██████████| 100/100 [09:26<00:00, 5.52s/it]

In [16]:

Accuracy on test data = 98.03%

Much higher accuracy – nice. I also tried this with the hierarchical model but it achieved lower accuracy (95%), I assume due to overfitting.

Lets make more use of the fact that we’re in a Bayesian framework and explore uncertainty in our predictions. As our predictions are categories, we can’t simply compute the posterior predictive standard deviation. Instead, we compute the chi-square statistic which tells us how uniform a sample is. The more uniform, the higher our uncertainty. I’m not quite sure if this is the best way to do this, leave a comment if there’s a more established method that I don’t know about.

In [17]:

In [18]:

In [19]:

In [22]:

As we can see, when the model makes an error, it is much more uncertain in the answer (i.e. the answers provided are more uniform). You might argue, that you get the same effect with a multinomial prediction from a regular ANN, however, this is not so.

Conclusions¶

By bridging Lasagne and PyMC3 and by using mini-batch ADVI to train a Bayesian Neural Network on a decently sized and complex data set (MNIST) we took a big step towards practical Bayesian Deep Learning on real-world problems.

Kudos to the Lasagne developers for designing their API to make it trivial to integrate for this unforseen application. They were also very helpful and forthcoming in getting this to work.

Finally, I also think this shows the benefits of PyMC3. By relying on a commonly used language (Python) and abstracting the computational backend (Theano) we were able to quite easily leverage the power of that ecosystem and use PyMC3 in a manner that was never thought about when creating it. I look forward to extending it to new domains.

This blog post was written in a Jupyter Notebook. You can get access the notebook here and follow me on Twitter to stay up to date.