We will now zoom in on the heatmap we produced earlier by only showing the variables of interest. This could potentially reveal some underlying relations!

# Build the correlation matrix
matrix = df_train[cols].corr()
f, ax = plt.subplots(figsize=(8, 6))
sns.heatmap(matrix, vmax=1.0, square=True)

I see definitely some clusters here! GarageCars and GarageArea are strongly correlated, which also makes a lot of sense. Furthermore, TotalBsmtSF and 1stFlrSF are also correlated which also makes sense. And we intended to only use variables which were correlated to SalePrice which is also visible in this plot. Great! Now we will start with some Machine Learning and try to predict the SalePrice!

Machine Learning (Random Forest regression)

In this chapter, I will use a Random Forest classifier. In fact, it is Random Forest regression since the target variable is a continuous real number. I will split the train set into a train and a test set since I am not interested in running the analysis on the test set. Let’s find out how well the models work!

from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split

pred_vars = [v for v in interesting_variables.index.values if v != 'SalePrice']
target_var = 'SalePrice'

X = df_train[pred_vars]
y = df_train[target_var]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.1, random_state=42)

model = RandomForestClassifier(n_estimators=100, random_state=42)
model.fit(X_train, y_train)

y_pred = model.predict(X_test)

Build a plot

plt.scatter(y_pred, y_test) plt.xlabel(‘Prediction’) plt.ylabel(‘Real value’)

Now add the perfect prediction line

diagonal = np.linspace(0, np.max(y_test), 100) plt.plot(diagonal, diagonal, ‘-r’) plt.show()

That is great! The red line shows the perfect predictions. If the prediction would equal the real value, then all points would lie on the red line. Here you can see that there are some deviations and a few outliers, but that is mainly the case for prices which are extremely high. There are some outliers in the low range and it would be interesting to find out what is the cause of these outliers. To conclude, we can compute the RMS error (Root Mean Squared error):

from sklearn.metrics import mean_squared_log_error, mean_absolute_error

print('MAE:\t$%.2f' % mean_absolute_error(y_test, y_pred))
print('MSLE:\t%.5f' % mean_squared_log_error(y_test, y_pred))

MAE:$23552.62 MSLE:0.03613A deviation of $23K,- which is mainly due to the extreme outliers, not too bad for a quick try! This is definitely something to keep in mind when buying a house!

Conclusion (TL;DR)

With the help of just a Random Forest Classifier (which is in fact Random Forest regression), it is possible to predict the house prices fairly good! So, if you are about to buy a house, please contact me! Oh, and if you are interested in learning more about Pandas, definitely check out this article. If you are interested in writing an article on Data Blogger, please do so here! Yes, you will get paid :-).

The description of the competition can be found on Kaggle and my final notebook can be found here. Interested in predicting the value of your car? Then definitely read this article which uses a Neural Network for the price prediction. Another article on another Kaggle competition about restaurant reservations can be found here.

House Price Prediction using a Random Forest Classifier

Kevin Jacobs

Machine Learning (Random Forest regression)

Build a plot

Now add the perfect prediction line

Conclusion (TL;DR)