A Neural Network for predicting Restaurant Reservations

In this blog post series, we will use a neural network for predicting restaurant reservations. This first post will describe how we can use a neural network for predicting the number of days between the reservation and the actual visit given a number of visitors.

The Data

I joined this Kaggle competition. This competition provides you with the data. In my notebook, I just focused on the visit_datetime (the time the actual visit took place) and the reserve_datetime (the time the reservation was made) columns.

It is straightforward to examine the difference between visit_datetime and reserve_datetime using Pandas! You can do so using the following line of code:

df_joined['lag_days'] = (df_joined['visit_datetime'] - df_joined['reserve_datetime']).apply(lambda x: x.total_seconds() / 3600. / 24.)

It is then possible to make a histogram out of it. I skip outliers of more than 30 visitors, since there are only a few datapoints available for reservations of large groups.

We will start with a simple script using Pandas for plotting the distribution of the “lag” times (where lag is defined as the number of days of visit_datetime – reserve_datetime).

df_joined['lag_days'][df_joined['lag_days'] < 30].hist(normed=True)
plt.title('Reservation lag')
plt.xlabel('No. days')
plt.ylabel('Frequency')
plt.show()

This results into the following histogram:

This histogram makes sense! It is a long-tail distribution which tells us that most of the visits happen a one up to a few days after the reservation. It is very unlikely that a visit happens more than a month after a reservation was made. Is there an underlying cause for the long-term reservations (in which the lag is more than 30 days)? Let’s find out!

The dataset also provides us the number of visitors for which a reservations is made. We can now zoom in on the relationship between the “lag” of the reservation and the number of visitors:

df_joined['reserve_visitors'] = df_joined['reserve_visitors'].astype(int)
visitors_lag = df_joined.groupby('reserve_visitors').apply(lambda x: x['lag_days'].mean())
visitors_lag.keys = ['visitors', 'lag_days']
visitors_lag = visitors_lag.to_frame('Lag days')
visitors_lag[visitors_lag.index < 60].plot()

This results into the following plot:

Sorry for the quick and dirty plot! However, it does signal some kind of relation between the number of visitors and the lag. The more visitors, the longer the time between the reservation and the actual visit. Why is the plot so wobbly for a large number of visitors? That is because there is less and less data available if the number of visitors increases. From this plot we can conclude that there probably is a relation between the number of visitors and the lag. Now it is time to figure out what this relation is!

The goal

Figure out the relation between the number of visitors and the lag. That’s all for now!

Why Neural Networks?

Neural networks are general function approximators. Now, I don’t know the exact underlying function. The best I can do is come up with an as-close-as-possible approximation for the underlying function. The disadvantage is that it is hard to know why it works if we get it to work, but at least we can give it a try.

The results

The first thing to notice is that our values are not normalized. The number of visitors is a number and gets larger and larger. To normalize it, we simply divide it by 100, since all numbers are below 1. The same holds for the lag. Most of the lags are lower than 30. Therefore, I will divide the lag size by 30.

Notice that there are many more approaches for normalizing the data! This is just a quick normalization on the data, but feel free to use your own normalization method. My normalization process is closely related to the MinMaxScalar normalization which can be found in sklearn (scikit-learn).

With just a few lines of Python code we can create a Multi-Layer Perceptron (MLP):

from sklearn.neural_network import MLPRegressor

# Grab some samples (otherwise it will take a long time to train)
samples = df_joined.sample(25000)

X = samples['reserve_visitors'].values.reshape(-1, 1) / 100.
y = samples['lag_days'] / 30.

# Fit our model on the data
model = MLPRegressor(hidden_layer_sizes=(100, 100,), activation='relu')
model.fit(X, y)

# Make the predictions for any number of visitors between 1 and 100
X = np.arange(1, 100).reshape(-1, 1)
y = model.predict(X / 100.) * 30.

# Make the plot
fig, ax = plt.subplots(1, 1)
ax.plot(X, y)

visitors_lag.plot(ax=ax)
plt.show()

And this results into the following plot:

Great! See how close the blue line is to the observed data! Our Neural Network did a great job :-).

Conclusion (TL;DR)

A Neural Network is perfect for approximating unknown functions. In this blog post, we used Python to approximate the reservation lag given a number of visitors. Feel free to post any questions/comments below!

Another post about another Kaggle competition on house price prediction can be found here.