Source
XGBoost has become incredibly popular on Kaggle in the last year for any problems dealing with structured data. I was already familiar with sklearn’s version of gradient boosting and have used it before, but I hadn’t really considered trying XGBoost instead until I became more familiar with it. I was perfectly happy with sklearn’s version and didn’t think much of switching. So why have so many started using it? There are a variety of answers I am sure others have come up with, but the main reason for me can be summarized in one word:
Speed.
You can see from this paper on XGBoost’s first major debut in the Higgs-Boson Kaggle competition that XGBoost has a nice speed advantage over sklearn. Here is Figure 3 from the paper:
In addition, Szilard Pafka has done an excellent benchmark of a variety of machine learning libraries and he also found XGBoost to be much faster.
What about XGBoost makes it faster? Gradient boosted trees, as you may be aware, have to be built in series so that a step of gradient descent can be taken in order to minimize a loss function. Unlike Random Forests, you can’t simply build the trees in parallel. XGBoost, however, builds the tree itself in a parallel fashion. You can see more details about this in section 4.1 of the XGBoost paper by Chen and Guestrin here, but essentially the information contained in each feature column can have statistics calculated on it in parallel (along with an initial sort of the columns).
What you get in return for this speed is a much faster grid search for optimizing hyperparameters in model tuning.
To show you what the library can do in addition to some of its more advanced features, I am going to walk us through an example classification problem with the library.
Getting and Preprocessing the Data
I decided a nice dataset to use for this example comes yet again from the UC-Irvine Machine Learning repository. This dataset is the classic “Adult Data Set”. An older set from 1996, this dataset contains census data on income. Our job is to predict whether a certain individual had an income of greater than 50,000 based on their demographic information. In the dataset description found here, we can see that the best model they came up with at the time had an accuracy of 85.95% (14.05% error on the test set). Let’s see if we can beat that, shall we?
First, load in numpy/pandas and download the data, which is split into train/test sets already for us. Make sure to skip a header row in this case or pandas will assume the first row is a header.
1 |
|
train_set = pd.read_csv(‘http://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data’, header = None) test_set = pd.read_csv(‘http://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.test’, skiprows = 1, header = None) # Make sure to skip a row for the test set
1 |
|
col_labels = [‘age’, ‘workclass’, ‘fnlwgt’, ‘education’, ‘education_num’, ‘marital_status’, ‘occupation’, ‘relationship’, ‘race’, ‘sex’, ‘capital_gain’, ‘capital_loss’, ‘hours_per_week’, ‘native_country’, ‘wage_class’]
1 |
|
train_set.columns = col_labels test_set.columns = col_labels
1 |
|
<class ‘pandas.core.frame.DataFrame’> RangeIndex: 32561 entries, 0 to 32560 Data columns (total 15 columns): age 32561 non-null int64 workclass 32561 non-null object fnlwgt 32561 non-null int64 education 32561 non-null object education_num 32561 non-null int64 marital_status 32561 non-null object occupation 32561 non-null object relationship 32561 non-null object race 32561 non-null object sex 32561 non-null object capital_gain 32561 non-null int64 capital_loss 32561 non-null int64 hours_per_week 32561 non-null int64 native_country 32561 non-null object wage_class 32561 non-null object dtypes: int64(6), object(9) memory usage: 3.7+ MB
1 |
|
Hmm. There don’t seem to be any. According to the accompanying notes, the original datasets had 32561 in train and 16281 with test. However, unknowns are included and have been labeled with a question mark (?). The test results documented were done after removing all unknowns. Therefore, to see if we can beat their best results, we need to remove the same unknown rows.
If we do so, we should have 30162 in train and 15060 in test. Let’s see if we can remove all of these unknown rows.
It turns out the question mark was actually entered with a space first. Let’s do a simple test to see what would happen if we dropped all rows that contain an unknown marked with a ‘ ?’.
1 |
|
test_set.replace(‘ ?’, np.nan).dropna().shape
1 |
|
train_nomissing = train_set.replace(‘ ?’, np.nan).dropna() test_nomissing = test_set.replace(‘ ?’, np.nan).dropna()
1 |
|
test_nomissing[‘wage_class’] = test_nomissing.wage_class.replace({‘ <=50K.’: ‘ <=50K’, ‘ >50K.’:’ >50K’})
1 |
|
test_nomissing.wage_class.unique()
1 |
|
train_nomissing.wage_class.unique()
1 |
|
Now these look the same. There is one thing we need to do, however, before applying XGBoost. We need to make sure everything encoded as a string is turned into a variable.
Applying Ordinal Encoding to Categoricals
Our next step is to use ordinal encoding for the features with a string category since XGBoost (like all of the other machine learning algorithms in Python) requires every feature vector to include only digits. To do this properly, however, we will need to join the two together and apply our encoding so that the same categories exist in both datasets when we do train/test evalulation. There is some debate as to whether you are technically supposed to use one-hot encoding for categorical features (since there isn’t supposed to be any inherent ordering) but I found this discussion on the subject interesting. It seems that one-hot encoding isn’t necessary for tree-based algorithms, but you may run into a problem using this on linear models. I personally find ordinal encoding is more than sufficient and it significantly reduces your memory footprint if you have quite a few categorical features.
First, combine them together into a single dataset.
1 |
|
<class ‘pandas.core.frame.DataFrame’> Int64Index: 45222 entries, 0 to 16280 Data columns (total 15 columns): age 45222 non-null int64 workclass 45222 non-null object fnlwgt 45222 non-null int64 education 45222 non-null object education_num 45222 non-null int64 marital_status 45222 non-null object occupation 45222 non-null object relationship 45222 non-null object race 45222 non-null object sex 45222 non-null object capital_gain 45222 non-null int64 capital_loss 45222 non-null int64 hours_per_week 45222 non-null int64 native_country 45222 non-null object wage_class 45222 non-null object dtypes: int64(6), object(9) memory usage: 5.5+ MB
1 |
|
for feature in combined_set.columns: # Loop through all columns in the dataframe if combined_set[feature].dtype == ‘object’: # Only apply for columns with categorical strings combined_set[feature] = pd.Categorical(combined_set[feature]).codes # Replace strings with an integer
1 |
|
Now that we have all of our features encoded, we need to split these back into their original train/test sizes. Since they haven’t been shuffled we just need to retrieve the same indices as before.
1 |
|
We can now finally start playing around with XGBoost.
Initial Model Setup and Grid Search
We still have our target value inside our train and test frames that needs to be separated from the feature vectors we will be feeding into XGBoost. Let’s get those now.
1 |
|
Now import the libraries we will need to do grid search for XGBoost. Fortunately for those of us used to sklearn’s API, XGBoost is compatible with this, so we can still utilize the traditional GridSearch with XGBoost.
1 |
|
To see all of the available parameters that can be tuned in XGBoost, have a look at the parameter documentation. This should help you better understand the choices I am making to start off our first grid search. I am going to start tuning on the maximum depth of the trees first, along with the min_child_weight, which is very similar to min_samples_split in sklearn’s version of gradient boosted trees. We set the objective to ‘binary:logistic’ since this is a binary classification problem (although you can specify your own custom objective function if you wish: an example is available here).
1 |
|
Now let’s run our grid search with 5-fold cross-validation and see which parameters perform the best.
1 |
|
GridSearchCV(cv=5, error_score=’raise’, estimator=XGBClassifier(base_score=0.5, colsample_bylevel=1, colsample_bytree=0.8, gamma=0, learning_rate=0.1, max_delta_step=0, max_depth=3, min_child_weight=1, missing=None, n_estimators=1000, nthread=-1, objective=’binary:logistic’, reg_alpha=0, reg_lambda=1, scale_pos_weight=1, seed=0, silent=True, subsample=0.8), fit_params={}, iid=True, n_jobs=-1, param_grid={‘min_child_weight’: [1, 3, 5], ‘max_depth’: [3, 5, 7]}, pre_dispatch=’2*n_jobs’, refit=True, scoring=’accuracy’, verbose=0)
1 |
|
optimized_GBM.grid_scores_
1 |
|
We can see that the first hyperparameter combination performed best and we already beat our target of 85.95% accuracy in our cross-validation! Let’s try optimizing some other hyperparameters now to see if we can beat a mean of 86.78% accuracy. This time, we will play around with subsampling along with lowering the learning rate to see if that helps.
1 |
|
GridSearchCV(cv=5, error_score=’raise’, estimator=XGBClassifier(base_score=0.5, colsample_bylevel=1, colsample_bytree=0.8, gamma=0, learning_rate=0.1, max_delta_step=0, max_depth=3, min_child_weight=1, missing=None, n_estimators=1000, nthread=-1, objective=’binary:logistic’, reg_alpha=0, reg_lambda=1, scale_pos_weight=1, seed=0, silent=True, subsample=1), fit_params={}, iid=True, n_jobs=-1, param_grid={‘subsample’: [0.7, 0.8, 0.9], ‘learning_rate’: [0.1, 0.01]}, pre_dispatch=’2*n_jobs’, refit=True, scoring=’accuracy’, verbose=0)
1 |
|
optimized_GBM.grid_scores_
1 |
|
No, it doesn’t look like we can improve on this. However, we may be able to optimize a little further by utilizing XGBoost’s built-in cv which allows early stopping to prevent overfitting.
Early stopping CV
Based on the CV testing performed earlier, we want to utilize the following parameters:
-
Learning_rate (eta) = 0.1
-
Subsample, colsample_bytree = 0.8
-
Max_depth = 3
-
Min_child_weight = 1
There are a few other parameters we could tune in theory to squeeze out further performance, but this is a good enough starting point.
To increase the performance of XGBoost’s speed through many iterations of the training set, and since we are using only XGBoost’s API and not sklearn’s anymore, we can create a DMatrix. This sorts the data initially to optimize for XGBoost when it builds trees, making the algorithm more efficient. This is especially helpful when you have a very large number of training examples. To create a DMatrix:
1 |
|
Now let’s specify our parameters (with slightly different syntax in some places for the XGBoost native API) and set our stopping criteria. For now, let’s be aggressive with the stopping and say we don’t want the accuracy to improve for at least 100 new trees.
1 |
|
Will train until cv error hasn’t decreased in 100 rounds. Stopping. Best iteration: 432
1 |
|
our_params = {‘eta’: 0.1, ‘seed’:0, ‘subsample’: 0.8, ‘colsample_bytree’: 0.8, ‘objective’: ‘binary:logistic’, ‘max_depth’:3, ‘min_child_weight’:1}
final_gb = xgb.train(our_params, xgdmat, num_boost_round = 432)
1 |
|
%matplotlib inline import seaborn as sns sns.set(font_scale = 1.5)
1 |
|
<matplotlib.axes._subplots.AxesSubplot at 0x11aaafa90>
1 |
|
importances = final_gb.get_fscore() importances
1 |
|
Given this dictionary, we can now make a plot of it ourselves.
1 |
|
<matplotlib.axes._subplots.AxesSubplot at 0x11adf7be0>
1 |
|
testdmat = xgb.DMatrix(final_test)
1 |
|
from sklearn.metrics import accuracy_score y_pred = final_gb.predict(testdmat) # Predict using our testdmat y_pred
1 |
|
You can see that the predict function for XGBoost outputs probabilities by default and not actual class labels. To calculate accuracy we need to convert these to a 0/1 label. We will set 0.5 probability as our threshold.
1 |
|
array([ 0., 0., 0., …, 1., 0., 1.], dtype=float32)
1 |
|
accuracy_score(y_pred, y_test), 1-accuracy_score(y_pred, y_test)
1 |
|
Our final accuracy is 86.94%, or a 13.05% error rate. We beat our goal by a whole percentage point!
There is probably more that we could do to improve this model, such as additional hyperparameter tuning or even some additional feature engineering. However, according to the guidelines of the original data, we beat their best performance. It has been 20 years since the data was originally submitted . . . so we shouldn’t be too surprised!
Ideas for Improvement and Summary
In this post, we explored some of the basic functionality involving the XGBoost library. We also learned how it works and why it performs faster than other gradient boosting libraries do. As a test, we used it on an example dataset of US incomes, beating the performance of other documented models for the dataset with very little effort. We were also able to investigate feature importances to see which features were influencing the model most.
Possible ideas for using XGBoost in the future:
-
Make sure to read the documentation thoroughly if you want to utilize some of its more advanced features
-
This is a rapidly evolving library, so additional features may be introduced that affect its functionality
-
XGBoost is available in most of the other common languages for data science as well such as Julia and R if those are preferable
-
XGBoost can also now be utilized at scale, running on Spark and Flink with the recently released XGBoost4J
Some other helpful information:
-
This YouTube video given by Tong He on XGBoost in detail. This describes the R version, but the API is very similar and there is a lot of great information in here on how XGBoost actually works.
-
Here is his slide deck that covers the same material.
-
Another tutorial guide on hyperparameter tuning from Aarshay Jain here
Personally, I wanted to start using XGBoost because of how fast it is and the great success many Kaggle competition entrants have had with the library so far. I’ve began using it in my own work and have been very pleased with the speed increase. If you have a modeling problem involving structured data, I’d encourage you to give the library a try! Thanks to all of the XGBoost developers and contributors for such a wonderful library.
If you would like the Jupyter Notebook for this blog post, you can find it here.