Denoising Dirty Documents – Part 10

In my last blog, I explained how to take advantage of an information leakage regarding the repeated backgrounds in Kaggle’s Denoising Dirty Documents competition. The result of that process was that we had done a fairly good job of removing the background. But the score from doing this was not good enough to get a good placing. We need to do some processing on the image to improve the score.

Today we will use an approach that does not require me to do any feature engineering – convolutional neural networks, which are neural networks where the first few layers repeatedly apply the same weights across overlapping regions of the input data. One intuitive way of thinking about this is that it is like applying an edge detection filter (much like I described here) where the algorithm finds the appropriate weights for several different edge filters.

I’m told that convolutional neural networks are inspired by how vision works in the natural world. So if I test whether convolutional neural networks work well, am I giving them a robot eye test?

While I am comfortable coding in R, there is little support for convolutional neural networks in R, and I had to code this in Python, using the Theano library. The reason that I chose Theano is because neural network model fitting can be quite time consuming, and Theano supports GPU based processing, which can be orders of magnitude faster than CPU based calculations. To simply the code, I am using Daniel Nouri’s nolearn library, which sits over the lasagne library, which sits over the Theano library. This is the first time I have coded in Python and the first time I have used convolutional neural networks, so it was a good learning experience.

Since my PCs don’t have top of the line graphics cards with GPU processing support, I decided to run my code in a cloud on a virtual machine with GPU support. And since I didn’t want go through the effort of setting up a Linux machine and installing all of the libraries and compilers, I used Domino Data Labs to host my analysis. You can find my project here.

Before I progress to coding the model, I have to set up the environment.

The Python script shown above creates a function that copies the Theano settings file into the appropriate folder in the virtual machine, so that Theano knows to use GPU processing rather than CPU processing. This function gets called from my main script.

In order to reduce the number of calculations, I used a network architecture that inputs an image and outputs an image. The suggestion for this architecture comes from ironbar, a great guy who placed third in the competition. This is unlike all of the examples I found online, which have just one or two outputs, usually because the online example is a classification problem identifying objects appearing within the image. But there are two issues with this architecture:

it doesn’t allow for fully connected layers before the output, and
the target images are different sizes.

I chose to ignore the first problem, although if I had time I would have tried out a more traditional architecture that included fully connected layers but which only models one target pixel at a time.

For the second problem, I used the script shown above to split the larger images into two smaller images that were the same size as the other small images in the data, thereby standardising the output dimensions.

Theano uses tensors (multi-dimensional matrices) to store the training data and outputs. The first dimension is the index of the training data item. The second dimension is the colourspace information e.g. RGB would be 3 dimensions. Since our images are greyscale, this dimension has a size of only 1. The remaining dimensions are the dimensions of the input data / output data. The script shown above reshapes the data to meet this criteria.The nolearn library simplifes the process of defining the architecture of a neural network. I used 3 hidden convolutional layers, each with 25 image filters. The script below shows how this was achieved.

Due to the unique nature of the problem versus the online examples, my first attempt at this script was not successful. Here are the key changes that I needed to make:

set y_tensor_type=T.tensor4 because the target is 2 dimensional
your graphics card almost certainly doesn’t have enough RAM to process all of the images at once, so you need to use a batch iterator and experiment to find a suitable batch size e.g. batch_iterator_train=BatchIterator(batch_size=25)

I also wanted to plot the loss across the iterations. So I added the two lines above, giving me the graph below.

During the first several iterations, the neural network is balancing out the weights so that the pixels are the correct magnitude, and after that the serious work of image processing begins.

I wanted to see what some of the convolutional filters looked like. So I added the two lines shown above, giving me the set of images below.

These filters look like small parts of images of letters, which makes some sense because we are trying to identify whether a pixel sits on the stroke of a letter.

The output looks reasonable, although not as good as what I achieve using a combination of image processing techniques and deep learning.

In the output image above you can see some imperfections.

I think that if I had changed the network architecture to include fully connected layers then I would have achieved a better result. Maybe one day when I have enough time I will experiment with that architecture.

The full Python script is shown below:

Like this:

Like Loading…

Related