Retrospective review of my first deep learning competition

A Kaggle competition Planet: Understanding the Amazon from Space has just ended. After teaming up with Dmitry we were lucky to reach the 100th place among 959 teams. I find the result good enough: that’s the first deep learning project I was doing from scratch. It was a real gift of fortune: before last submissions we were 200+ on the public leaderboard with almost no chances to get closer to the medal tier. Obviously, I will not be that lucky every time, so it’s time to understand my mistakes for the future.

The competition was a classic example of multilabel image classification: given a satellite image 256*256, one should annotate it with one or more labels. Labels were not independent: each sample had exactly one of four weather labels and any number of non-weather related labels (e.g. roads, water, mines etc.); also these two types of labels were interconnected: if an image was tagged as cloudy, there were no other labels for an obvious reason - no chance to see it.

Some numbers for better understanding: the metric of this competition was F2 score, best result at public leaderboard was 0.93449 and our final result was 0.93042 (0.93318 and 0.92915 at private leaderboard accordingly)

Initial pipeline

From the very beginning, I’ve decided to choose a complicated way. I was sure a lot of participants will base their solutions on the very obvious scheme: take an Imagenet pretrained network, add a Dense layer with 17 cells and a sigmoid activation, tune it a little and get your result. So I’ve decided to reinvent the wheel, and it was my mistake.

I tried to build a complicated pipeline with two networks: one will fit for weather (softmax activation, categorical crossentropy loss), another one - for the rest of the labels (sigmoid activation, binary crossentropy loss). Both networks were based on VGG and trained independently, it took a lot of time even on my Geforce 1080Ti.

To make it faster, I’ve created both naive and too sophisticated caching utils to read images once, resave more effectively and feed the generators with less reading from the filesystem. It was a great goal, but the realization was poor for two reasons:

the dataset was not shuffled properly → mini batches were imbalanced;
files were saved as np.int8 instead of np.uint8 → the dataset was corrupted due to overflow.

No surprise this solution performed badly, something like 0.8.

Wrong turn

Instead of looking for bugs and fixing them, I’ve decided to use black magic some less traditional approaches.

I’ve read a little about autoencoders in “Machine Learning with TensorFlow” and decided to make embeddings with an autoencoder and use them for shallow machine learning algorithms. The simplest autoencoder was built fast enough, representations were extracted and mixed with traditional CV features (e.g. histograms for each channel, mean, variance etc.). For each class I made a classifier using LightGBM, chose thresholds independently and sent a submission. The result was not that bad - something about 0.88, so I improved my score while still being in the bottom of the leaderboard.

Here was a moment where I broke my validation scheme. Thresholds were optimized on the same data as the validation score was calculated, so I could not trust validation numbers anymore. I didn’t care: there were a lot of free submissions daily, so I could validate myself using the leaderboard only.

After mixing autoencoder features + classic features + representations from my bad classifiers and putting them into LightGBM, I’ve achieved 0.919 - something about top 30% at the moment.

Common sense is partially back again

I’ve realized that people do reach 0.92 with public kernels and minor changes (e.g. increasing number of epochs or replacing Flatten layers with GlobalMaxPooling). That’s how I had to admit: my pipeline is buggy and needed to be rebuilt ASAP.

Instead of building some crap from scratch again, I’ve decided to dedicate some time to reading code from other deep learning competitions - guys had to deal with the same problems. I looked at Vladimir Iglovikov and Sergey Mushinskiy solution for a recent DSTL competition and realized that I should use h5py. It took some time to recode the pipeline almost from scratch, but it gave the results: two fine-tuned Resnet50 networks (one for weather and another one for the rest) gave me a score near 0.923.

The pipeline had a lot of comments like FixMe: add this and that. Augmentation was very straight forward and needed much more variance. If only I had the pipeline from the very beginning without this stupid square wheels bike I made…

There was not much time left, so I’ve hacked some experiments without proper accuracy: rare commits, lack of context in file names - that’s how I’ve wasted some them. E.g. I trained a network with preprocessing A and made a submit based on preprocessing B, which was an obvious fail. Some simplified pieces of this code are available as a Kaggle kernel.

Teaming up

Two hours before the merge deadline me and Dmitry decided to make a team. Both of us were newbies in DL, however we did some work and reached metrics we were not ashamed of (something like .923 for me and .926 for Dmitry). We made a deal to blend the solutions in the very end. But we didn’t discuss what exactly should each of us bring to the final melting pot.

Dmitry was fitting single Resnets and VGGs with test time augmentations, I kept my line with two networks and did a try with a double-output network (because this was mentioned in the “Deep Learning Book” as an interesting regularization tool). So at the end we had several different solutions, none of them was good to approach top 100: we had 200+ place.

On the very last day, we discussed potential ensembling. Blending and stacking required accurate K-fold prediction for validation data, we did not have it for some of the models. Another potential problem was validation issues: it’s too easy to fuck up with validation while stacking, and local score may not match the leaderboard. it s a significant risk for the last call: we only had 3 submissions left before the competition final. That’s why we chose a conservative voting.

There were 9 models mixed in the very end: seven various convolutional networks + one linear blending + one gradient boosting. I coded a voting function with a funny bug and almost corrupted one of the submissions - lucky for me, Kaggle didn’t accept a result file with no tags provided.

This bug was hidden at row = row.to_dict() row: merged dataset had 9 columns (one per model), but some of them had the same name - that’s how pd.merge works. So converting pd.Series to dict lost some of the columns with the same names (only names image_tags, image_tags_x, image_tags_y left), and there were no tags with more than 3 votes.

However, the bug was fixed, the ensemble voted, and the diversity shows a good result: the 128 place on the public leaderboard. In several hours the private leaderboard was opened and we were lucky to enter the medal tier. Not a huge achievement, but good enough to newbies like us.

Lessons learned

I have made a lot of mistakes, and now I feel familiar with image classification. Some reminders for the future:

Do not start with black magic. Make a classic solution as textbooks for dummies describe. You’re not a DL rockstar, you’re just hacking well-known approaches together.
Treat your Kaggle code as well as you treat your production code. Even more carefully: if you push some crap into work repository, your colleagues may find it via code review. But there are no code reviewers at Kaggle.
If you team up, discuss the plan in details from the beginning. Last day is not a good option.
Deep learning competitions are a great source of both fun and knowledge (much better than other competitions where people do stacking over stacking before some stacking). Keep participating!