An Even Dozen – Denoising Dirty Documents： Part 12

Over the past 11 blogs in this series, I have discussed how to build machine learning models for Kaggle’s Denoising Dirty Documents competition.

The final blog in this series brings the count to an even dozen, and will achieve two aims:

ensemble the models that we have built
take advantage of the second information leakage in the competition

Ensembling, the combining of individual models into a single model, performs best when the individual models have errors that are not strongly correlated. For example, if each model has statistically independent errors, and each model performs with similar accuracy, then the average prediction across the 4 models will have half the RMSE score of the individual models. One way to increase the statistical independence of the models is to use different feature sets and / or types of models on each. I therefore chose the following combination of models:

deep learning – thresholding based features
deep learning – edge based features
deep learning – median based features
images with backgrounds removed using information leakage
xgboost – wide selection of features
convolutional neural network – using raw images without background removal pre-processing
convolutional neural network – using images with backgrounds removed using information leakage
deep convolutional neural network – using raw images without background removal pre-processing
deep convolutional neural network – using images with backgrounds removed using information leakage

It turned out that some of these models had errors that weren’t strongly independent to other models. But I was rushing to improve my leaderboard score in the final 48 hours of the competition, so I didn’t have time to experiment.

I didn’t experiment much with different ensemble models. However I did test xgboost versus a simple average or a least square linear regression, and it outperformed both. Maybe an elastic net could have done a good job.

Here is the R code for my ensemble:

Ensembling materially improved my leaderboard score versus any of the individual models. I feel that was due to the use of different features across my 3 deep learning models. So now I had a set of images that looked quite good:

To my eyes, my predicted images were indistinguishable from the clean images in the training data. In a real world situation I would have stopped model development here, because the image quality exceeds the minimum requirements for OCR. However, since this was a competition, I wanted the best score I could get.

So I took advantage of the second data leakage in the competition – the fact that the cleaned images were repeated across the dataset. This meant that I could compare a cleaned images to other cleaned images that appeared to have the same text and the same font, and clean up any pixels that were different across the set of images. I experimented with using the mean of the pixel brightness across the images, but using the median performed better.

This information leakage halved the RMSE, and I suspect that it was what allowed the top two competitors to obtain RMSE scores less than 1%.

So that’s it for this series of blogs. I learned a lot from my first Kaggle competition. Competing against others, and sharing ideas is a fun way to learn.

Like this:

Like Loading…

Related