Understanding how Deep Learning learns to play SET®

In the past few years, deep learning has seen incredible success in image recognition applications. In this post I examine how to train a convolutional neural network to recognize playing card images from a game called SET®, explore the structure of the model to get some insight into what it is “seeing”, and present a webcam application that uses the deployed model in a near-realtime setting.

SET is a card game where the objective is to find triples of cards, called SETs, which meet certain matching criteria. Each card has four attributes: number, color, shading, and shape. And each attribute comes in three varieties: 1, 2, 3, (for the number attribute); green, red, purple, (color); empty, striped, solid, (shading); and diamond, oval, squiggle (shape). A pack contains one of each of the possible 81 (3×3×3×3) cards.

A SET is a collection of three cards where each attribute is either the same for each card in the collection, or distinct for each card. For example, the following is a SET, because the shading and shapes are all the same, while the color and number are all different.But these three cards do not form a SET, since the cards are neither all the same color, nor all different colors:

To play the game, one player lays out 12 cards face up on a table. Players then try to find a SET in the layout. The first player to find one takes the SET and replaces it with three cards from the deck. The winner is the player with the most SETs when the cards run out and no more SETs can be found. It’s also possible for only one player to play the game, patience-style.

The Challenge

How easy is it to get a computer to play a game of SET? A simulated SET computer game is actually no computational challenge at all. This is because the program simply needs to evaluate each triple in the 12 cards, and check if it’s a SET. There are (12 choose 3) = 12!/(3!9!) = 220 triples in 12 cards, so it is trivial for a computer to check them all in a fraction of a second. A human stands no chance of beating the computer. (By the way, for more on the mathematics of SET, have a look at The Joy of SET by McMahon et al.)

If you want the computer to play SET by playing with real cards, and using a camera to capture the scene, then it is a lot more challenging. This is a computer vision and machine learning problem, since the program needs to detect all the cards in the image, then classify each one with one of the 81 labels corresponding to all possible cards. It needs to do this in an accurate and timely manner too, since human opponents are capable of finding SETs relatively quickly—sometimes in under a second (especially if they are children, in my experience).

So the challenge is to train a machine learning model to correctly classify SET card images. In particular, how well does Deep Learning do on the problem compared to more traditional machine learning approaches? This is the question I attempted to answer.

Card Detection vs. Card Prediction

There are two steps in finding the cards in an image. The first is finding the locations of the cards in the layout by viewing them as white rectangles (detection), the second is assigning a class label to each card image by looking at the shapes on the card (prediction).

To simplify the first step, I used a piece of matt black cardboard for the playing surface. Then it was straightforward to use a computer vision library to detect the card edges (by converting the image to black and white, and finding contours, see this example) and extract the individual card images. The second step used machine learning to classify each individual card image.

Training Data

Supervised learning requires lots of data, so I produced a series of images of all the SET cards under various lighting conditions, from different angles, and using different cameras. This was actually the second dataset I collected, since initially I underestimated the variation due to lighting conditions, which meant that detecting the card color was very unreliable.

You can see the effect that lighting has on color in the image below, which is made up of all the training images of the same card (1 purple striped squiggle).

In total, the training set consisted of 13,149 card images, which, given there are 81 different cards, works out at an average of 162 images per class. It is often said that to effectively train a deep learning model you need thousands of images per class, however in this case there is less variation between images in the same class compared to traditional image classification (think of the variety of images labelled “cat”, for example), so having over a hundred images per class was sufficient.

Training the Model

Using Keras, I trained a convolutional neural network (CNN) on the training data. The data was randomly split into 70:20:10 training/validation/test splits: that is, 70% of the data was used for training the network, 20% was used for validation during training, and 10% was held out to find the test accuracy (percentage of cards correctly identified) after the best model had been selected.

The architecture of the CNN was based on one from chapter 5 of Deep Learning with Python by François Chollet. This code snippet shows how the model was built using Keras:

model = models.Sequential()model.add(layers.Conv2D(32, (3, 3), activation=’relu’, input_shape=(150, 150, 3)))model.add(layers.MaxPooling2D((2, 2)))model.add(layers.Conv2D(64, (3, 3), activation=’relu’))model.add(layers.MaxPooling2D((2, 2)))model.add(layers.Conv2D(128, (3, 3), activation=’relu’))model.add(layers.MaxPooling2D((2, 2)))model.add(layers.Conv2D(128, (3, 3), activation=’relu’))model.add(layers.MaxPooling2D((2, 2)))model.add(layers.Flatten())model.add(layers.Dropout(0.5))model.add(layers.Dense(512, activation=’relu’))model.add(layers.Dense(81, activation=’softmax’))

model.add(layers.Conv2D(32, (3, 3), activation=’relu’,

model.add(layers.MaxPooling2D((2, 2)))

model.add(layers.Dropout(0.5))

model.add(layers.Dense(81, activation=’softmax’))

The input images were resized to 150 by 150 (with three color channels). In the network there are four convolutional layers with max pooling, followed by two fully-connected (dense) layers. The final layer uses a softmax activation function to give a probability distribution over the 81 possible card labels.

A dropout layer was used to prevent overfitting. Without dropout, and using a smaller training set, the training accuracy was observed to increase until it hit 1.0 (blue dots in the plot below), while the validation accuracy stopped increasing after a while (blue line, around epoch 34), and then started decreasing—a telltale sign of overfitting. (In the plots below, the x-axis shows the epoch number, and the y-axis the accuracy.)

However, this showed that the network could converge, and after adding the dropout layer the validation accuracy no longer decreased in the long run, as seen in this 100 epoch run (note different scales to previous plot):

Training was carried out using Cloudera Data Science Workbench on a shared cluster with a pool of GPUs, although only one GPU was needed for training. The process of training involves lots of trial and error, so having an environment that makes it easy to tweak code and run a highlighted snippet of code is very productive. You can see the Python code used for training here.

Visualizing the Neural Network

One of the nice things about CNNs is that it is possible to visualize some aspects of the network. For example, you can see which parts of a particular image trigger the highest activation in the network. This gives intuition about what the network is doing, and can be useful to see why the network is making classification errors on particular images.

Consider the image below, which was generated using keras-vis in a Cloudera Data Science Workbench session. It shows a heatmap of pixel activation for a simpler network trained only on the three card colors (green, red, purple). For the given card (3 green solid diamonds), it’s interesting that it’s predominantly pixels on the edges of the shapes that activate the network most strongly.

The reason for this is that the edges are a reliable way to determine color whatever the shading of the shape. In the case of an empty shape, you can’t tell what its color is by looking at the interior pixels, only the edge pixels will do. And so the network only uses edge pixels, even when looking at solid shapes.

What’s impressive is that the network learned this behavior by itself, simply by being trained on lots of image data. Indeed, this demonstrates one of the promises of Deep Learning to extract features automatically, rather than relying on hand-engineered features.

Comparison with Traditional Computer Vision Approaches

In fact, before training a CNN on this dataset I tried a more traditional computer vision approach for card prediction. For predicting the shading of a card, for example, the code would look at a small patch near the center of the card and find the mean pixel value of the binarized image. The idea was that a small value would correspond to empty shading, a medium value to striped shading, and a high value to solid shading. A machine learning algorithm like k-nearest neighbors (k-NN) or Support Vector Machines (SVM) would learn the decision boundaries for each class. This approach worked in practice and gave an accuracy of 93% on held out test data (there was no significant difference between k-NN and SVM). However, the CNN approach—which did not require any feature engineering, and required substantially less code—was far superior, giving over 99% accuracy.

The overall accuracy of the model with hand-engineered features for each attribute (number, color, shading, shape), using k-NN, was 73%. With this level of accuracy the computer doesn’t play a very good game of SET, since the triples it picks out are often not SETs.

Using the CNN model the accuracy was over 99%. What’s more, the CNN predictions come with a confidence level (due to the softmax unit), so when selecting a SET from a layout we can favor the cards with higher confidence levels. This is not possible with k-NN or SVM, so using a CNN results in an even more robust game.

Deploying the Model

Once the model was performing well on test data it was time to incorporate it into an application. In this case I had a Java vision application (using the BoofCV library) so I chose to convert the Keras model to TensorFlow, then use the TensorFlow Java API to get predictions for new card images.

The application takes a webcam feed (which can be streamed from a phone using a suitable app) and highlights a SET that it finds in images from the feed. Card detection and prediction takes roughly one third of a second on a CPU, so, while not truly realtime, the performance is good enough in practice to find SETs faster than people can.

In the screen capture below you can see that the program finds SETs (highlighted in blue) even while the cards are being dealt.

Future Improvements

The two step card detection and prediction could be combined into a single step if the neural network were to perform object detection as well as classification. (One complication is that there may be more than 12 cards on the table, since the rules of SET permit more cards to be placed in the layout if there is no SET in the first 12.) There is promising research in this area, such as YOLO9000.

Finally, it would be very nice to have an augmented reality app to play SET.

Acknowledgements: thank you to my friends Moriah Ulinskas and Dominic Willsdon for introducing me and my family to SET on New Year’s Eve 2010.

The code used in this blog post can be found **here**.

Cloudera Data Science Workbench is available to Cloudera customers who license Cloudera Enterprise Data Hub and Cloudera Data Engineering editions. Learn more about Keras here. Cloudera does not distribute or support Keras.