Introduction to XGBoost

Before diving deep into XGBoost, let us first understand Gradient Boosting.Boosting is just taking random samples of data from our dataset and learning a weak learner (a predictor with not so great accuracy) for it. There is one catch though, we give more weights to those data points which were misclassified by the previous learners.I know that this is making a little sense to you. Let us clear it with the help of an example and a visualization.

Example:

Suppose that you are given a binary classification problem. Also let us assume that we are using Decision tree stumps (small decision trees) as our weak learner. There are weights assigned to each data point (all equal initially) and the error is the sum of weights of misclassified examples.

Step:1

Here, we will have a weak predictor (h_1) which classifies the data points.

Step:2

Now, we will increase the weights of the points misclassified by (h_1) and learn another predictor (h_2) on the data points (can be a sample of data points also). In another words, we are trying to minimize the (residual) = (y)–(h_1).

Step:3

Current Hypothesis: (H) = (h_1) + (h_2) Now, we will learn another predictor (h_3) on the data points and try to minimize the (residual) = (y)-((h_1) + (h_2)).

We will perform these steps upto say (m) times, each time giving boosted importance to misclasified points and trying reducing the residual.Our final hypothesis will be a weighted sum of these weak learners. (H) = (w_1)(h_1) + (w_2)(h_2) + … + (w_m)(h_m)

Visualization

Below, there are three weak predictors and they are combining to become a strong one. Also notice that each subsequent predictor tries to classify correctly what the last predictor misclassified.

Ques: How do I set m?

Ques: How do I set m?

Well, (m) is a hyperparameter and is tuned using cross-validation.

What kind of predictors can ‘h’ be?

There are mathematical restrictions on it (diffrentiablity requirements). Genreally, decision tree stumps are used.

I highly recommend you to see Abhishek Ghose’s answer for a Mathematical-cum-intuitive explanation and this for visualization of the process.